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Abstract. We give a new consistent scoring function for struc¬ 
ture learning of Bayesian networks. In contrast to traditional ap¬ 
proaches to score-based structure learning, such as BDeu or MDL, 
the complexity penalty that we propose is data-dependent and 
is given by the probability that a conditional independence test 
correctly shows that an edge cannot exist. What really distin¬ 
guishes this new scoring function from earlier work is that it has 
the property of becoming computationally easier to maximize as 
the amount of data increases. We prove a polynomial sample com¬ 
plexity result, showing that maximizing this score is guaranteed 
to correctly learn a structure with no false edges and a distribu¬ 
tion close to the generating distribution, whenever there exists a 
Bayesian network which is a perfect map for the data generat¬ 
ing distribution. Although the new score can be used with any 
search algorithm, in |BS13| we have given empirical results show¬ 
ing that it is particularly effective when used together with a linear 
programming relaxation approach to Bayesian network structure 
learning. The present paper contains all details of the proofs of 
the finite-sample complexity results in |BS13| as well as detailed 
explanation of the computation of the certain error probabilities 
called /3-values, whose precomputation and tabulation is necessary 
for the implementation of the algorithm in [BSI3] . 
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CHAPTER 1 


Introduction 


Bayesian networks are graphical representations which encapsulate relations of 
causation and correlation between different observables or hidden factors of a com¬ 
plex system. The purpose of such a representation is to facilitate both conceptual 
understanding and automated probabilistic reasoning (prediction) concerning such 
systems. 

Let X — {Xi,X 2 ,..., Xn) be a collection of random variables representing such 
observables or hidden factors. We will concern ourselves with the case when each 
Xi is discrete, that is takes values in some finite set Val(Xi). For any probability 
distribution P over X we can write the joint probability as a product of conditional 
probabilities 
( 1 ) 

n 

P{Xi — ,..., Xji — — H(Aix — ir 1) Pi^Xif — \ Xi — ,..., Xi/—i — Xj^— i) . 

i /=2 

The factorization Q is called the chain rule. In the most general probabilistic 
model for A, the number of parameters involved is exponential in the number of 
variables n. Since it is common to have n in the hundreds or thousands, this makes 
basic tasks such as parameter estimation impractical. Perhaps more fundamentally, 
such an overly complex model gives no insight into the structural relationships 
between the factors represented by the Xi. In many real-world systems modeled by 
collections of random variables—especially those considered in artificial intelligence 
such as speech recognition, natural language processing, biological and medical 
networks—even when the total number of factors (variables) is large (e.g. in the 
thousands), each factor interacts directly with only a handful of other factors in the 
system. Thus, each factor turns out to be (at least conditionally) independent of 
the vast majority of other factors in the system, and a sparse representation of the 
system is preferable. 

Bayesian network representations of such systems simultaneously capture these 
conditional independence relationships and make the inference tasks we wish to 
carry out more tractable. They do this by eliminating a large proportion of the 
conditioning variables within each the conditional probability distribution factor 
appearing the chain rule Q. 

Formally, we may define a Bayesian network over A as a pair (G, P) where 
G = {V,E) is a directed acyclic graph (DAG), satisfying the following conditions: 
the nodes V correspond to the Xi £ X and the directed edges E correspond to 
direct influence among the nodes; P is a probability distribution over X such that 
all the correlations observed in P are “represented” by the influence relations shown 
in G. We have to give a more precise explanation of the statement about P and G. 
The most simple-minded interpretation of the DAG representation G would be that 
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pairs of random variables not directly connected by an edge in G are independent 
in P. This interpretation is not correct. Rather, the correct interpretation of the 
DAG in terms of conditional independence is given by the so-called local Markov 
property, which we state as follows: 

Local Markov Property of a Bayesian Network: every vari¬ 
able is conditionally independent of its non-descendants given its 
parents. 

It is not difficult to see that the following is equivalent to the local Markov property: 
if Xi,..., Xn represents topological ordering of the Xi with respect to G, we may 
rewrite the chain rule 0. omitting all conditioning factors Xj except those which 
are parents of X,^: 

n 

(2) P(Xi = xi,...,X„ = T„) = 0 P{X, = x,\Xj = Xj for allX, G PaG(A,)). 

V = 1 

Here Pa,a{Xy) denotes the parents of X^, in the DAG G. 

In general, a BN G is called an independence map (or I-map) for a distribu¬ 
tion P if all the (conditional) independence relationships implied by G are present 
in P {i.e., G is consistent with P); in that case case, G is also called a perfect 
map (or P-map) for P if the conditional independence relationships implied by G 
are the only ones present in P (G is consistent with P and no model simpler than 
G is consistent with P). 

We are concerned with the problem of learning the structure of Bayesian net¬ 
works. Ghapters 16-18 of |KF09| give a very general framework for considering 
this problem and an overview of the progress in this already well-studied problem. 
We will not attempt to summarize all of the relevant literature, but here and in 
the comments below concerning relations of our results to the literature, we will 
explain why the problem is both difficult and interesting. The simplest case in 
which the problem of learning the BN structure from data can arise is when we 
have two random variables, which we will call Xa and Xb- It is not difficult to see 
that there are only two nonequivalent BN structures: 

Go '.Xa Xb (“disconnected”), 

Gi :Xa —>■ Xb (“connected”). 

A word about the notion of equivalence of BN structures: changing the direction 
of the arrow changes the DAG Gi, but not the probability distributions which 
can be represented by Gi. This substantiates the statements that all connected 
BN structures on V = {Xa,Xb} are equivalent to the one, Gi, which we have 
depicted. The problem of learning the structure from data in this instance is as 
follows: devise a decision procedure which, given a sequence ujn of N independent 
samples (i.e., observations) of the joint distribution, returns Go in the case that Xa 
and Xb are independent in the generating distribution, and returns Gi in the case 
that Xa and Xb are dependent in the distribution. As in all learning problems, 
we cannot avoid the possibility of error, but seek to bound the error probability by 
some i5 > 0. In other words, in this case, the structure learning problem is identical 
to one case of hypothesis testing, a well-studied and classic problem in statistics. 

Before discussing the methods in any detail we note that even in this simplest 
of cases a number of the subtleties of the situation are evident. First, there are 
two different types of errors: learning the network Gi when the underlying network 
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in Go and learning the network Go when the underlying network is Gi. Note 
that the first situation deserves to be treated differently from the second, because 
whereas the independence relationships implied by Gi (namely, 0) are consistent 
with the model Go, the independence relationships implied by Go {{Xa -U- Xb}) 
are not consistent with the model Gi. In the terminology of Bayesian networks, 
in the case the underlying network is Go and we learn a dependent distribution 
(corresponding to Gi), we have learned an I-map of Go which is not a P-map. This 
is strictly speaking a model which is consistent with the data, but nevertheless 
a model which is more complex than necessary (violating Occam’s Razor). In 
the case the underlying network is Gi and we learn an independent distribution 
(corresponding to Go) we have learned a distribution which is not even an I-map 
of the true network, which is a more clear-cut case of error. In light of these 
consideration we clarify the scope of the problem considered in this paper: 

In this paper, we assume ujn is a sequence of observations of the 
random variables X generated i.i.d. from an unknown Bayesian 
network {G,P), with P a probability distribution over X (no hid¬ 
den variables or unobserved factors). We interpret the problem of 
Bayesian structure learning as recovering from ujn a perfect map 
G for P. 

Thus, learning a model which is an I-map but not a P-map of the underlying 
generating Bayesian network (e.g. Gi in the case the samples are being generated 
from Go) is counted as an error in our framework. Depending on the practical 
application, the experimenter may choose to count this as a less “costly” error than 
the error of learning a network which is not even an I-map. 

So far, by considering only the very simplest case of a two-node network, we 
have avoided confronting a major source of complexity in the BN structure learning 
task. In the two-node case the Bayesian network structure is determined (up to 
equivalence of BN’s) entirely by the skeleton Skel(G) (undirected graph obtained 
by making all edges of the DAG G undirected). Already in the case of three nodes, 
this is no longer true. For example, the following structures, 


G : Xa ^ 

-Xb- 

-A- Xc (“common cause”) 

G' -.Xa- 

^Xb^ 

— Xq (“common effect”) 


share the same (connected) skeleton Xa — Xb — Xq- Yet they are definitely not 
equivalent because in the case of case of the “common cause” network, we have 
Xa )X Xq, but Xa -U- Xc\Xb, whereas for the “common effect” network, the 
situation is exactly reversed, which is to say that Xa )X Xc\Xb but Xa -LL Xq- 
(Intuitively, when i? is a common cause of A and G, observing B decouples, or 
renders independent, the effects A and G in our probability estimates; when B is 
a common effect of independent causes A and G, observing B couples, or renders 
dependent, the causes A and G in our probability estimates). So even if we had an 
“oracle” which handed us the Skel(G), we would still need to perform additional 
(conditional) independence tests to distinguish G from the nonequivalent network 
G'. Some methods break down the structure learning problem into the problem of 
learning the Skel(G), followed by the problem of orienting the edges of the DAG 
correctly, but we do not. Terminologically, the structure seen in G' is called (for 
obvious reasons) a “v-structure”. The possible appearance of “v-structures” at 
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many locations of the DAG G is responsible for much of the complexity of the 
edge-orientation task in this learning scheme. 

By concentrating on these small-n examples, we would not want to give the 
reader the impression that a scalable strategy for learning the structure of Bayesian 
networks is to enumerate all the possible DAG’s on the vertices and to sift through 
them in succession until we find the one best-fitting the data. For one thing, there 
are too many such structures: the number a„ of DAG’s on n vertices is given 
[Rob77] by the recurrence relation 

k=l ^ ' 

and a„ has exponential growth in n. In fact it can be shown l |Chi96j and [Das99j l 
that the problem of BN structure learning is NP-complete, and moreover this re¬ 
mains true even if we restrict the space of BN’s to be considered to those of bounded 
in-degree. Further, the result of a single independence test may constrain the space 
of possible BN’s in a rather complicated way, and the naive way of iterating through 
the structures to look for the right one does not take advantage of this. With a 
more sophisticated strategy one can achieve a running time of much better than 
exponential in most cases (at least if the goal of learning the “correct structure” is 
relaxed in certain ways, as we will describe). 

Structure learning in Bayesian networks is already a well-studied problem, and 
there are two classes of techniques that are frequently used. The first approach is 
to construct a scoring function which assigns a value to every possible structure, 
and then to hnd the structure which maximizes the score |LB94l[HGC9^ . The 
scores measure how well a Bayesian network with a given structure can fit the data. 
The scores also include a complexity penalty which biases toward structures with 
fewer edges. The simplest type of score built out of these general principles is the 
BIG (Bayesian Information Griterion) score: 


(3) ^^,(ccAr,G) =LL(u;^|G)-i/>i(7V)- |G|. 


Here LL{uj]s[\G) is the log-likelihood of the data given G, |G| is the number of 
parameters and the most frequent choice of weighting function is 


^i{N) := 


log Af 
2 


in which case the score is also known as the minimum description length (MDL) 
score. This choice of ipi{N) has a theoretical justification in terms of Bayesian 
probability (see, e.g., p. 802 of [KF09] ). If no assumptions are made about the data 
generating distribution, finding the structure which maximizes the score is known 
to be NP-hard [Chi96l ICHM041 IDas99] . Many heuristic algorithms have been 
proposed for maximizing this score, such as greedy hill-climbing |Chin2L 
and, more recently, by formulating the structure learning problem as an integer 
linear program and solving using branch-and-cut |CusllL[JS~GM10| . 

The second approach is based on statistical hypothesis tests, where one seeks 
conditional independencies that appear to hold true in the observed data. Using 
these, one can constrain the space of Bayesian networks that could have generated 
the data. Such approaches are commonly referred to as constraint-based approaches: 
the results of the tests of conditional independence provide the logical “constraints” 
on the model. If we make the key assumption that each variable has at most a 
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fixed number of parents d, then this approach can yield an algorithm for structure 
learning, see ISGSOIL IPV91| . which runs in polynomial time in the parameter 
n = Card(y). However, this approach has a number of drawbacks: difficulty setting 
thresholds, propagation of errors, and inconsistencies. It is futhermore difficult to 
use this approach together with other prior knowledge about the network structure 
that we are attempting to learn. 

Our approach combines the two main approaches to structure learning of 
Bayesian networks in a novel way. In keeping with the approach via statistical in¬ 
dependence testing, we view the structure learning problem as statistical recovery, 
that of recovering the true underlying structure from samples from its distribution. 
In keeping with the approach via score functions, we incorporate the statistical hy¬ 
pothesis tests into our framework via certain additional terms added onto the BIC 
score (§. The additional terms help the score function rule out incorrect skeletons 
and hone in more quickly on the correct skeleton than the BIC score (see |BS13| 
for experimental evidence). In contrast, the log-likelihood and complexity penalty 
terms from the BIC score act as a Bayesian prior: with small sample size, they 
prevent the sparsity boost from having undue influence, and with large sample size, 
they select a model which is “close” to the true G among those models sharing 
Skel)!?) (the additional terms are not sensitive to the orientation of DAG edges). 


1. Objective function with data-driven sparsity boost 


Our algorithm consists of computing and optimizing the following objective 
function, which is parameterized by two “weighting” functions ijji and ijj 2 

and depends on the graph structure G and the data in the form of a sequence lon 
of independent samples from the underlying distribution: 


^r,.^i.^.(ccAr,G) = LL{ojn\G) - M^) ■ |G| 


In /3^''r(p(wjv,^,-B|s)) 



max mm 

SeSA,BiG) sGval(S) 


Here LL(a;Ar|G) is the maximum log likelihood of observing the data given the graph 
structure G, while |G| is the number of parameters in the model represented by G, 
so that the first two terms of ^ BIC score with function ipi weighting 

the complexity penalty |G|. The third term, which is our novel contribution to the 
scoring function, is perhaps better thought of as a sparsity boost (as opposed to a 
complexity penalty), which rewards G for each nonexistent edge. 

The reward going to an {A, B) ^ E(G) increases as evidence accumulates that 
there is some set S of conditioning variables which renders random variables A and 
B independent. It is well known (Section 3.4.3.1 of |KF09| 1 that among all possible 
parent sets that could render nodes A and B independent, it suffices to consider 
only the two sets PaGA\i? and PaGf3\A. In this paper, we generally, assume that 
in all the networks we consider the number of parents of any node is bounded by 
a constant d. So, if we wish to iterate algorithmically over all the “witnesses” S 
to the possible independence of A and B, there are two natural possibilities: first 
pair of possibilities {PaGA\i?, PaGB\A}, and second the collection of all subsets 
of cardinality d of V^‘^\{A,B}. 

Now conditioning on each possible witness set S on each assignment s G Val(S') 
of the variables in S, we obtain a potentially different sparsity boost: how do we 
produce one single sparsity boost for {A,B)1 On the one hand, the edge {A,B) 
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should not exist in the graph if conditioning on any one of the relevant S renders 
them independent, so we take the maximum sparsity boost over all relevant S. 
One the other hand, for the conditioning set S to render A and B independent, 
all of the conditional joint distributions of A and B, for the various s G Val(S'), 
have to be independent, so for fixed S we take the minimum sparsity boost over all 
s G Ya\{S). 

Intuitively, we can explain the sparsity boost, as follows: r = 

t{uJ]s[) is the (conditional) mutual independence of the two variables calculated on 
the basis of the evidence ojn- With infinite data, if the two variables really were 
independent, r should be zero, that is t{u!n) —)■ 0 as —)■ oo. Suppose we had 
some quantitative notion, 17 > 0 , of the minimal amount of dependence that we 
can expect for any edge that truly exists in the Bayesian network. We want to 
ensure that we do not confuse this minimally dependent edge with independence. 
If the two variables really are independent, then with enough data, we would see 
that r goes to zero whereas the mutual information of the minimally dependent 
edge goes to ij. Even though T(wAr)—^OasfV—)' 00 ,a minimally dependent edge 
can produce an empirical r which is smaller than t{lon)- The distribution is 
a “reference” distribution with certain minimal dependence rj := Since for 

two independent variables t{uJ]s[) —)■ 0 as fV —)■ 00, the event that the reference 
distribution emits a Y/v with t(Yn) < t{ujn) has probability shrinking to zero 
as iV —)■ 00. The quantity 0^ t{ujn) inside the logarithm is precisely the probability 
of this event. In case of true independence of the variables the probability of this 
event should approach 0 and its — In should approach 00 os fV —)■ 00; in the case of 
dependence of the variables of strength at least rj, it should remain bounded away 
from 0 (because of the assumption of a minimal dependence), and its — In should 
remain bounded from above. 

The connection to hypothesis testing may be expressed as follows: the test 
statistic T is a Neyman-Pearson test statistic for the decision between null hypoth¬ 
esis Hq, independence, and the alternative hypothesis Hi. This is because t{ujn) 
is the log of the likelihood ratio of observing ojn under hypothesis Hq versus Hi. 
We compute from T{ujpf) the probability /3^ t{ujn) of Type II error associated with 
the Neyman-Pearson test with threshhold t{ujn). In the objective function, we 
use not T itself, but the CDF of r, considered as a random variable on the set of 
outcomes {wat}. The justification for using the CDF of r, namely, /3^ t{ujn) is that 
it is the probability of accepting Hq under the Neyman-Pearson test with thresh¬ 
old t{ujn), when the data Yn is drawn from a distribution p'^ whose true mutual 
information (namely 77) is unknown to the experimenter but is really 77 (for some 
77 > 0 ). The reason for using t{ojn) is that it is the smallest threshold for which 
the Neyman-Pearson test with that threshold returns Hq as its decision given the 
evidence ujn as input. Therefore, T{ujpf) can also be categorized as the threshold 
of the Neyman-Pearson test with smallest (3 which accepts Hq from the evidence 
lon- Conventionally, in statistics, one denotes the complement of the Type II error 
probability as the power of the hypothesis test, so that we have relationship: 

Power — 1-/3. 

From this we obtain the most powerful explanation of the use of t(ujn). t(u!n) is 
the threshold of the most powerful Neyman-Pearson test which would classify the 
pair A, B, based on the empirical evidence, as independent. As the power increases 



2. DESCRIPTION OF MAIN RESULTS 


15 


to 1, the minus log of its complement—the sparsity boost — ln/3(T(a;Ar)) —increases 
to oo. 

The Type II error /3^ should be contrasted to the Type I error associated 
with the test. In this context, the Type I error is defined as probability a^r(wAr) 
of the Neyman-Pearson test of threshold t{ujn) accepting Hi when the data Y/v 
is drawn from a product distribution pq. We use /3^ instead of because we 
wish to produce a sparsity boost for absent edges, rather than a complexity penalty 
for parameters, in the candidate network. Formally, /3^ t{ujn) is a “Type II error 
probability associated to the most powerful test with Type I error , but 

henceforth we refer to 13^t{u}n) more concisely as a “/3-value”. 


2. Description of main results 

As one of our main results, we obtain finite sample complexity results. The¬ 
orems 1^ and In the PAC learning framework (see e.g. |MRT12] ) a finite 
sample complexity result refers to a formula for a function A^ : (0,1) —^ Z (or in 
some cases an asymptotically equivalent elementary function) such that the algo¬ 
rithm learns the correct model from data ujn with N = N(6) datapoints, with 
probability 1 — i5. There are a few caveats about applying the PAC model to our 
results. First, our finite sample complexity result is about learning not necessarily 
the correct Bayesian network (G, P), but about one “close” to the correct network, 
in a certain sense. Our result is stated in two parts: the first part is about the 
speed at which the algorithm learns a Bayesian network (G', P') for which P' is 
“close” to P in a sense which we will specify below. The second part is about the 
speed at which the algorithm learns a network (G',P') for which G' has no false 
edges, meaning edges of the skeleton of G' which do not appear in the skeleton of 
G. The reason for doing this is that the second part is independent of the problem- 
parameter f, whereas the hrst part is dependent on f. Ultimately we will derive a 
result (Theorem]^ which gives the rate at which the algorithm learns a network 
(G', P') for which Skel(G') C Skel(G) and d(P', P) < C- This result will, of course, 
depend on 

In addition to the parameter f = d(P', P) just mentioned, the result depends 
on certain parameters of the hidden underlying network (G, P). First we need to 
assume that the contingency table representing any conditional joint distribution 
occurring in G, if not a product distribution, has a minimial separation from the 
set of product distributions. Similarly to the practice in [zsns], this minimal 
separation is expressed as a KL-divergence e of the contingency table from the 
product distribution sharing the same marginals. Second, we need to assume that 
certain marginalizations of P stay away from the walls of the probability simplex. 
This separation-from-the-walls assumption is expressed as a pair of integers m, rh. 
First, m is an integer such that the reciprocals of all entries of all contingency 
tables obtained by marginalizing over all possible parent sets of size d are less 
than m. Second, m is an integer such that the reciprocals of all entries of all 
contingency tables obtained by marginalizing over a certain restricted subset of 
the possible parent sets of size d are less than m. This restricted subset only has 
to be large enough that it contains a certificate of independence for each of the 
pairs of vertices not connected by an edge. For a more complete explanation of 
separating sets for independence, see|1.3[ below. For now, we just note that m is in 






2. DESCRIPTION OF MAIN RESULTS 


16 


general smaller than m, potentially substantially smaller. Then there is the error- 
tolerance parameter C which is unlike e & m because it is under the control of the 
experimenter. Finally there are certain hyperparameters: first 77 and k which, while 
constrained to lie in certain intervals, are “free,” in the sense that otherwise they 
are up to the experimenter. Second, there are hyperparameters 0 , 0 , A, 71, which 
can be freely chosen in the unit interval (0,1), and which do not appear in the 
objective function, but only in the formula for N in the finite sample complexity 
result. In order to separate the different variables by role, we write the function N 
in the finite sample complexity, as 

N{e, m, m, n; d; C.; 77; k, 0 , 0 , A, 77). 

Sometimes, for the sake of readability of formulas, we drop the hyperparameters 
(?7, K, 0 , 9 , A, fi) from our notation for N. Then our main theoretical result. Theorem 
[^a) and [6 is a formula for N. As a corollary of the main formula, we readily deduce 
(Theorem ^ (a) and (c)) that 


N{€, m, rh, n; ( 5 ; C) G O 





as e, ( 5 , C —>■ 0“'', and m, m, n —>■ 00. 


for tjji{N) = K\og{N) and 7/72 = 1 and only binary random variables in the networks 
considered. The ipi would be relatively easy to change, but the binary random 
variable assumption seems a bit trickier to remove. We do, however, conjecture 
that a similar result holds for networks with multivalued discrete variables. 

Our other main results are computational. In order to turn the formula for 
the objective function into the basis of a practical algorithm, we have to have 
access to a table of values of / 3 ^ (7), allowing us to look up values for the 77,7 = 
t{ujn) that we compute from the data. Since a table can contain (3 values for 
only finitely many ri,"f,N, our software library uses not only a look-up, but also 
an interpolation algorithm to obtain an accurate approximation for (7). In 
Chapter we describe the essential ideas used in our computation and tabulation 
of (approximations for) a large number of j 3 ^ (7), as well as the interpolation used 
by the algorithm to generate its approximations. We have made publicly available 
both the source code and the tables of / 3 ’s generated by the code in |Brel3| . 


Relation to results in the literature. Our results are limited to the problem of 
learning a Bayesian network without any hidden variables, and from a data series 
without any missing values. In other words, we assume that every point of ujm 
is a Iy [-vector, where V is the number of variables in the generating network. In 
contrast, our results do deal with the issue of generalization error in that Theorems 
[|:a) and [3 a) and (c) guarantee that the learned distribution P' lies ^-close (in KL- 
divergence) to the generating distribution P. 

For the benefit of readers who are familiar with the results on the BIC score 
in the literature, we make a few preliminary comments relating our results to these 
results. There are three “representative” types of results, those of |Hof93| . those 
of |FY96i . and those of fZMD06| (in [DBLOel b Our result is most like that of 
[Hof93] in that we put a bound on the in-degree of the learned network and achieved 
a result which is polynomial in the total number of network variables. This may be 
contrasted to |FY96| where the bound is exponential in the number of variables. 
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and may be contrasted to [ZMD06j in that the finite sample complexity holds for 
the family of competing networks simultaneously rather than for individual com¬ 
petitors. Unlike [H6f93| though, we do recover (with high probability) the correct 
skeleton with the hnite sample. We will discuss these matters further, and also 
make a comparison between our approach and the previous “hybrid” approaches to 
the learning problem in the literature, in Section[T]of Chapter]^ “Discussion”. Also 
in that section, we will briefly discuss existing hybrid approaches such as Relax of 
[FaslO] and MMHC of [TBA06] . and the points of differentiation of our approach 
from their approaches. 
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CHAPTER 2 


Finite Sample Complexity of the Scoring Function 


In the next section we give a concise overview of the notation and definitions 
needed to understand the finite sample complexity result. Also, in the course of 
the proof that follows we will invoke, without detailed explanations, several classic 
results from information theory, statistics, and the Theory of Large Deviations. 
Readers wishing a more leisurely introduction to any of these topics are referred to 
the following works: for general notions concerning Bayesian networks and proba¬ 
bilistic graphical models, the first few chapters of [KF09J : for information theory, 
[CT06] . particularly Chapter 11; for the Theory of Large Deviations, [DZ09] . par¬ 
ticularly Sections 2.1 and 3.6; for hypothesis testing in general. Chapter 9 of [ESI; 
and for hypothesis testing with composite hypotheses [Hoe65] . 

1. Notation and Preliminaries 

In all cases, the problem of learning Bayesian network structure involves ex¬ 
amining a contingency table representing a (possibly conditional) probability dis¬ 
tribution over the (Cartesian) product X = Xa x Xb oi two random variables, Xa 
and Xb- For any of these random variables, say X, we denote by iXj or card(X) 
the cardinality of val(X). In the case when the Bayesian network being learned is a 
structure over discrete variables, which is the case we are examining in this paper, 
all the numbers |Ai| are finite. By convention, we use k to denote the cardinality 
of Xa and I to denote the cardinality oi Xb, and we use V = Vk,i to denote the 
probability simplex, identified with the collection oi k x Z-contingency tables filled 
with nonnegative numbers summing to 1. These quantities are related by 

|Ar| = card(X) = k x I = dim(P) -|-1. 

In the case of two binary random variables, to which we will often restrict ourselves 
in various proofs below, k = I = 2, |X| = 4, dim(P) = 3. 

By p we will denote an element of V, which the standard parameterization of 
V identifies with a set (or contingency table) of |Ai| nonnegative numbers summing 
to 1. Considering a joint distribution p G V as fixed, pa,Pb denote the first and 
second marginal distributions of p, identifiable as vectors whose entries equal the 
sums of the rows (in the case of A, say) and the columns (in the case of B, say) of 
the entries in the contingency table p. The entries of the contingency table, denoted 
Pi,j, i^i^k, l<j<l are the probabilities of the atomic events {i,j) under the 
joint distribution p. Similarly, the entries of the vector pA, denoted pAi, are the 
probabilities of the atomic event i for the marginal distribution pA] similarly for 
the entries pBj oi pb- 

We introduce notation for representing certain distributions of particular inter¬ 
est. For example, Vq denotes the submanifold (of dimension k-\-l — 2) of V consisting 
of probability distributions which are products of their marginal distributions. For 
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PA, Pb as above pq{pa,Pb) denotes the unique product distribution with first mar¬ 
ginal distribution pA and second marginal distribution pb- In the case of binary 
marginal random variables, pA and pb are represented by a single parameter each, 
so we will write this as po{x, y) for x = pa,o, y = Pb,o, both in [0,1]. For x G [0,1], 
by po(x) = po(x, x), we denote the special case of a product distribution with equal 
marginals pA = pB = x. Further, p^ is the uniform distribution; p'^ = po(l/2,1/2) 
in the binary case. 


1.1. Quantifying Edge Strength. One of the most fundamental functions 
to our approach is t, defined as the test statistic t(p) := H{p\\'Pq) of the 
(Hoeffding-)Neyman-Pearson test between null hypothesis Vq and alternative hy¬ 
pothesis V — Vo- 

Definition 1. The mutual information (test statistic) rfp) := H{p\\Vo) is the 
Kullback-Leibler divergence from p to po, where po minimizes the KL-divergence 
H[p\\po), subject to the constraint po G Vq- 


The distribution po is also known as the M-projection of p onto Vo and is known 
(Chapter 8 of |KF09p to be the unique product distribution sharing the same 
marginals as p. While V-y denotes the (disconnected) submanifold of V consisting 
of probability distributions p such that t(p) > 7, denotes the (complementary) 
set oi p GV with r(p) < 7: we have Vy = V — One of the parameters of the 
finite sample complexity, the strength of an edge in the Bayesian network, denoted 
e, is defined below in terms of the test statistic r. 

Frequently we will consider, a distribution p{t) lying on a path in V based 
at p, consisting of a one-parameter family of distributions, all sharing the same 
marginals. Usually, but not always, the parameterization is based at a distribution 
p G Vo, meaning that p(0) G Vo- In the case of binary random variables, there is 
up to reparameterization only one choice of p{t), and we will assume contingency 
tables of the following form: 


p{t) = 


Pm+t 

Pw -1 


Poi - t 
Pii +1 


t G (-min(poo,Pii) ,min(poi,Pio))- 


In cases of fc or Z > 2, there are many possible choices for p{t), which will have to 
be specified depending on the context. For example, one choice is (in the below i 
ranges only from 0 to 2 [|J — 1 , and j ranges only from 0 to 2 [lj — 1 ) 

p{t)=P + {-mm{p,^j\i+ j even}, min {p,j\i+j odd})- 


Once the path p{t) is defined (as it is now for the binary variables case), we 
can define p'', for 77 > 0 a “standard reference” probability distribution having test 
statistic 77 , in the following sense. By definition, 

P^ = P°{t) for the unique t > 0 such that T{p^{f)) = 77 . 

In particular, p^ has uniform marginals. Of course, p^ is defined only for the 
(interval of) realizable values of 77 , depending on k and L Also once we have defined 
a path and fixed a realizable KL-divergence 7, t)(, denotes the unique positive 
parameter for which q'^ = Qoity) for q"^ some other distribution (defined in the 
context) satisfying T{q'^) = 7 and qo G Vo sharing the marginals of q'^- When the 
“standard reference” distribution is defined and it is clear no other path q(t) is 
meant, t^ denotes the unique positive parameter value for which p^ = p°(t+). 
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The test statistic gives us one of the main tools that we need to quantify 
edge strength. The other is separating collections for independence, the subject of 
Section [Ol where we will complete the quantitative definition of edge strength. 

1.2. A /3-value for Independence Tests. We now define the most impor¬ 
tant quantity of all, /3^ ( 7 ), which we call for the sake of concision the /3- value. It 
is more accurate, though not feasible, to call this quantity the probability of type 
II error of the Hoeffding-Neyman-Pearson test with null hypothesis Pq 
and alternative hypothesis p'^, for it is defined as 

Pn (7) := PW-P" < 7}- 

In order to apply these ideas of hypothesis testing to the setting of Bayesian net¬ 
works, we have to recall operations on joint probability distributions that are con¬ 
sidered frequently in all parts of the study of BN’s. In the most general setting 
we are given a Bayesian network (G, P), with G the directed acyclic graph (DAG) 
(y, E). Let S CV, and s G Val(S'), A, B G V — S. An empirical sequence ujn of N 
samples from the network, without any missing data, is a set of |D|-tuples of val¬ 
ues, with the coordinate element in the |17|-tuple corresponding to A (say) coming 
from Val(Ar^). Taking frequency counts of these |t7|-tuples, and normalizing by the 
sample size, we obtain the empirical joint probability distribution p,^^^ (also written 
as p{uin)). Let p be a joint distribution over the \V\ variables in G, for example, 
arising from frequency counts in the manner just described. Then pa,b denotes the 
probability distribution over the pair of variables Xa, Xb, obtained by marginal¬ 
izing out all other variables (if any) in the network. Further, for a joint assignment 
s of the variables S, we use Ps to denote the probability distribution obtained by 
restriction, i.e., conditioning on this joint assignment. Then, starting with joint dis¬ 
tribution p = p,^j^ over all the variables, {pa,b)s, also written as p(A, B\s), denotes 
the probability distribution associated to the contingency table of the variables Xa 
and Xb, conditioned on the joint assignment s to the variables in S. Since the 
multiple subscripts involved in writing all this out are too cumbersome, in practice, 
when we consider several of these operations applied in succession to we write 
them inside parentheses at normal text levels, so that p(uJ]y, A, Bjs) is the joint 
distribution of Xa and Xb derived from taking the normalized frequency counts 
of the empirical sequence cujv, conditioning on s, marginalizing out variables other 
than Xa, Xb- 

1.3. Separating Sets for Independence. In order to formulate the objec¬ 

tive function and the finite sample complexity result, we also must introduce some 
notions related to families of Bayesian networks. Let Q be any subset of all the 
possible Bayesian networks on a set of random variables Xy indexed by a vertex set 
V. We refer to ^ as a family of Bayesian networks. We consider the functions 
Sa.b on Q, indexed by pairs A, B G V, taking values in subsets of the power set of 
V — {A,B}. In other words, using to denote the Cartesian square of the 

vertex set V minus the diagonal, we have 

S = {^A.B : G ^ \A,Bg 

is a mapping that takes a pair of distinct vertices A, B in the subscripted places, 
and an element G G G the remaining place, and outputs a subset of the power set 
oiV — {A, B}. It is usually more convenient to think of the parameters A, B G V 
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{A 7 ^ B) as fixed and consider one mapping, which is denoted in a shorthand 
notation by 

Sa,b ■■ G ^ 

where |y| — 2 is understood to mean V — {A, B}. Note that we informally call such 
a collection of mappings § a collection of eligible separating sets for Q. 

Here is the context for families G of families of BN’s and collections of eligible 
separating sets for Q: suppose that we are given a family Q and ujn generated from 
the underlying, but unknown network G € G and wish to learn G (or a network 
close to G). In particular we often consider Q = consisting of all candidate 
networks such that every vertex has at most d parents, for some positive integer d. 
Let S = {Sa.b I e y^^\A} be a collection of eligible separating sets for Q. 

Common examples of Sa,b, include 

. SA.BiG) = {SC V\{A,B} I 1^1 < d}, 

. SaMG) = {PaG(^)\S, PaG(S)\H}. 

Note that the first Sa.b{G) is actually independent of G. Now we can make the 
following very important definition. 

Definition 2. A collection 

§ = : G ^ \A,B& 

is called a separating collection for G in Q provided that for all {A, B) G H^\A 
we have 


{A, B) ^ G if and only if there exists S G Sa,b(G) such that A Al B\S. 

Here, as usual “{A, B) G G” is shorthand for “{A, B) belongs to the edge set E(G) 
of G”. In this case, for any G' G G, the value Sa,b(G') (which is itself a subset of 
the power set ofV — {A, B}) is called a separating set. 


The intuition behind this definition is that a separating set S separates G 
from competing undirected structures Skel(G') for all G' G in Q. A separating 
collection § for G in Q provides a certificate for statistical recovery of G. We may 
think of the quantity e as the strength of this certificate, or alternatively, as the 

margin of the separating set S for G G S: 


e = e(G,e,S) 


min min max T(p(A,B\s)). 
{A,B)eG seSA.BiQ) seVaKS) 


where 

Sa,b{G) denotes UGee Sa,b{G). 

The smaller e is, the more “noise” there is in the network G, and the more data is 
needed to distinguish G from the networks in Q which differ from G by dropping 
one or more edges from G. Put another way, the reciprocal of e is a measure of the 
“noise” in wg considered as evidence for the (undirected) structure Skel(G). We 
will refer to e as the minimum edge strength of the network G with respect 
to Q and S. The minimum edge strength will be an important factor in the finite 
sample complexity result. 

In addition to Sa,g{G), dehned above, we also have the following derived quan¬ 
tities defined in terms of §. First, we define 


Ss(G) := max |Aa.b(G)|, 
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and 

l^g{G) ■■= max 

(A.B)Gy><2 

which are two related notions of the number of possible separating sets for a given 
edge. We also define 

as(G) := max max IS’I, 

(A.B)gGSgS^,b(G) ' 

which is the maximal size of any separating set in the collection for G. In our sample 
complexity results, we will make the assumption that ag{G) and T,g{Q) are at most 
polynomial in n of degree at most d (typically, ag{G) < d and ^g{G) < ((]))■ 

1.4. New Objective Function. At this point, we have introduced all the 
components making up S',;(G,wat), the score function of the network G on the 
empirical sequence a; at, which we may now define as, 

Sri,ipi,ip2i^N-, G) = LL{uj]\[\G) — tjji{N) ■ |G| 

+ fj 2 {N) V maxggs^ (G) min -In 

sGval(5) 

{A,B)iG ^ ' 

The first line of the definition is the BIG score, and the second line is the sparsity 
boost. It is generally assumed that V'i(-^) oo but that —)■ 0 as N —> oo. 
We mainly consider the case when if 2 {N) = 1, although other choices are possible. 
When s is fixed it is much more convenient to write t(wa/) in place of the more 
correct but cumbersome t{p(ujn, A, B\s)). 

The finite sample complexity result will be stated in terms of the Lambert-W 
function, in particular in terms of the branch W-i, which is uniquely characterized 
as the unique functional inverse of ze^ which takes real, negative values between 
— 1/e and 0 (see [CGH+96| for details). In order to extract a finite sample complex¬ 
ity in terms of more elementary functions, we need to make a further examination 
of the asymptotic behavior of the expressions involving W defined by 

yV{x) := —IT_i(—a;), fora; G ^0, . 

The main thing that the reader has to understand about yV{x) for our purposes 
can be summed up as follows 

as a: G (O, i) approaches 0 from the right, >V(a;) is pretty much 
the same as — ln(a:). 

For the precise statements that are needed in our Theorems, see Lemmas and 
[29] in Section 13 

1.5. Error tolerance. Since, similar to the situation in |FY96| . our finite 
sample complexity results will allow for the learning algorithm to return some 
structure “C-close to” the generating structure (G, P), we have to explain how to 
quantify the distance between BN structures. First, fix C > 0. What we say here 
could in principle be stated for any notion of distance, but we will adopt the most 
common notion of distance between distributions, the KL-divergence, and thus f 
will always denote a KL-divergence between distributions. 

Suppose that X is an arbitrary variable set (e.g. the set corresponding to the 
vertices F) and B is an arbitrary distribution over X (typically B is the under¬ 
lying distribution). In order to specify the notion of a distance of a BN structure 
{G',P') over X from B, the main notion that we need is the distribution pg',b- 


Pnt{p{ujn,A,B\s)) 
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Those familiar with the field may already know po' ,b as the “entropy minimizing” 
distribution among distributions which factor according to G'. Alternatively, pg',b 
is the M-projection of B onto the set of distributions which factor according to G' 
(have G') as an I-map. Those wishing more background and context for pg',b may 
see Appendix [B| The characterization of pg',b as an M-projection is most useful 
for our purposes because it points out that pg'.b is the solution to a constrained 
minimization problem where H(B\p) is the objective being minimized and where 
the constraint is that G' is an I-map for p. Thus, we define the distance of G' from 
B as the value of this minimum, namely H(B\pg',b)- 

In order to derive from this notion of distance a notion of the distance of G' 
from a Bayesian network {G,P), we simply adapt the above definition by setting 
B equal to a distribution such that (G, B) satisfies the faithfulness assumption: in 
other words G is a perfect map (P-map) of B: 

Definition 3. Let {G,B), {G',P') denote a pair of Bayesian networks over a 
variable set X satisfying the faithfulness assumption, that G is a perfect map for 
B and G' is a perfect map for P'. Then the distance of G' from G is defined 
as the divergence of pg',b from B, namely H{B\pg\b)- Fix (j > 0. We say that 
a learning algorithm or decision procedure returns a learning algorithm 
Q-close to {G,B) if it returns a network {G',P') whose distance from {G,B) is 
less than 

The quantity <^, not our e, corresponds to the e of [FY96j . Our e is more 
closely (though somewhat more general) than the minimum “information content” 
considered in [ZMD06] and 

2. Concentration of Entropy and Mutual Information 

Let X(p) be a Bernoulli random variable with parameter p G [0,1]. Let p^ be an 
empirical estimate of the parameter p derived from N samples drawn independently 
from X{p). The Law of Large Numbers and Theory of Large Deviations assure us 
that pn p almost surely as N ^ oo and give us upper bounds on the probability 
of pn being “far” from p for specific values of N. In all parts of the proof below, 
we will need similar, quantitative results concerning the (probable) convergence of 
estimates to various quantities derived from such estimates of p, for example, the 
empirical estimates of the entropy of a (possibly multinomial) distribution to the 
true entropy, and the mutual information between two marginal distributions to 
the true mutual information. Since the basic term appearing in both the entropy 
and mutual information is of the form p\ogp, our first task will be to combine the 
standard results of the Theory of Large Deviations with some analysis to develop 
an upper bound on the probability of pn log pat being “far” from plogp for specific 
values of N. 

The techniques we are now going to apply can be used much more generally 
to transform the estimates from the Theory of Large Deviations into effective esti¬ 
mates concerning the convergence of /(pat) to f{p) for / various specific bounded, 
continuous functions. Naturally, the case we are primarily interested in is the case 
f{x) = X log X so all our efforts will be focused on that case, leading up to Propo¬ 
sition!^ below. 

Lemma 1. Multiplicative form of the Chernojf bounds. Let p be the 

parameter of a Bernoulli random variable X{p). Let be the empirical estimate 
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ofp derived from the frequency counts of a sequence of N samples drawn i.i.d. from 
X{p). 

(1) The probability that p^ exceeds the expected value p is bounded as follows: 

Pr|M > (l + e)| 

(2) The probability that pN falls short of the expected value p is bounded as 
follows: 




P 


„-Npe^/3 


Proof. This is one of the forms of the Chernoff inequality, given for example 
as Theorem 9.2 in |KV94j . □ 


It is very easy to deduce a two-sided version from the one-sided Chernoff bound. 


Lemma 2. With the notation as above we have 


Pr 





For the proof of Lemma from Lemma see Section below. 

Our aim in the following is to prove an effective version of Lemma 1 from 
|Hof93| . which we will achieve in Proposition below. In order to do that, the 
next proposition will bound the probability of large deviations of f(x) = xlogx. 
The main techniques we will use in the proof are the Chernoff bounds (Lemma 
above) and the Mean-value Theorem. Let us note that the principal difficulty in 
obtaining the result is the need to obtain a bound which is independent of p, a 
task which is complicated by the fact that as p —>■ O'*', we have f'{p) —> —oo. In 
Appendix we prove a much simpler “one-sided” bound. Proposition |24[ which 
applies only when p > p. The proof of that proposition makes use of the well-known 
inequality 


(4) log(x) < X — 1. 

(with equality at the point of tangency x = 1); the multiplicative Chernoff bounds; 
and elementary analytic estimates. The following two-sided bound, is the only one 
we use in the sequel. 


Proposition 1. Let e>0,pG [0,1]. Denote by p^ the empirical average of 
a sequence drawn from the Bernoulli distribution X{p). Then we have 

Pr {Ipjv logPAT — plogpj > e} < 3exp A^min 

Note that this bound is independent of p. 

Proof. Let /(x) = xlogx. Our task is to bound the probability under p^ ^ 
X{p) of the event 

{\f{pN) - f{p)\ > e}. 

We consider two cases, first when Pn ^ \p (we use the constant ^ for convenience), 
and second when pn < \p- Tlie reason for breaking up the argument into these 
two cases is that if > ^P, then the derivative of /(x), which is 

/'(x) = Id- log X, 


[3(1 +log 2)2 


exp(IT-i(-e)) 
24’ 12 
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can be bounded, in absolute value, on the interval with endpoints p and p, in terms 
oip, whereas whenpjv < this cannot be done because f'{x) approaches negative 
infinity as cc —> 0^. 

So assuming that pn ^ ^P, note that the interval with endpoints {pn,p} is 
a subinterval of (|,1)- The derivative of f'{x), that is equals which is 

positive on the entire unit interval. Thus, the maximum of the absolute value f'{x) 
on any subinterval of the entire unit interval can occur only at the endpoints. That 
is to say. 


\fip')\ < max||/'(l)|, f (I) I for allp' S (|,1 


The mean value theorem says that there is a p' S (f; l) satisfying the mean value 
property: 

fiP) - f{p) = {PN -p)f{p')- 

Now, take the absolute value of both sides, assume that \f{pN) — f{p)\ > e, and 
solve for jp^v — p\, to obtain the inequality 

I ' I ^ ^ 

^ - max(|/'(l)|, |/'(p/2)|) max(l, |l + logp-log2|) 


Dividing both sides by p, we obtain 


PN 

P 


e 

pmax{l, |1 + logp + log2|} ' 


Thus far we have shown that 


I/(pn) - /(p')l > e 


Pn 


- 1 


> 


pmax{l, |1 + logp + log2|} ’ 


Using Lemma[^ (which applies independently of any assumption pjv < p!) we have 
the estimate 


(5) 

Pr{l/(m) - /(P)l > ., > ip} < 2exp ( 3p,„,„(i_|i~+"lgp + log2P) ) ' 

In order to obtain a bound independent of p, we have to find the global maximum 
of the function appearing in the denominator, namely 

g{x) = xmaxjl, |1 + logo; + log2p} 


on the unit interval. By solving the equation 1 + log x + log 2 = 1, we eliminate the 
max and obtain that g{x) is piecewise defined in terms of elementary functions by 

{ a;(l + log X + log 2 )^ 0 < a; < 

X < X < ^ 

a;(l + loga: + log 2 )^ i < x < 1. 

We calculate 

( (log( 2 ) + log(x) + 1 )^ + 2 log( 2 ) + 2 log(x) + 2 , if 0 < x < 
g'ix) = < 1 , if 56 “^ < X < f . 

[ (log( 2 ) + log(x) + 1 )^ + 2 log( 2 ) + 2 log{x) + 2 , if f < x < 1 


The function (log(2)+log(x) + l)^ + 21og(2) + 21og(x) + 2 has zeros at and 

Of these, only the first is a zero of g'{x). By substituting in values of x from 
the intervals, we see that p'(x) is positive on (O, negative on |e“^). 
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Figure 1. Top left. Graph of / illustrating that if we fix = 

^p, then p = ^ is the unique value in the unit interval at which 
Hpn) = f{p)- Top right. Graph of /' illustrating its monotonicity. 

Bottom left. Graph of g illustrating its continuity and the location 
of its unique global maximum on the unit interval at the right 
endpoint 1. Bottom right. Graph of g' illustrating that its only 
zero is at \e~^. 

and positive on Therefore, by one-variable differential calculus, g{x) is 

increasing on (0, ^e~^), decreasing on (|e“^, ^e~^), and increasing on the intervals 
(ie“^,l). Further, note that whereas g'{x) is not continuous, g{x) is continuous 
(which is clear from the definition of g{x)). The continuity of g{x), combined with 
the preceding observation of the slope of g{x) on the different intervals allows us 
to rule out ^ as a possible local extremum of g{x). As a result, the only possible 
points for the maximum of g{x) on the interval are ^e~^ and 1. The values of g{x) 
at these points are 2e~^ and (1 -|- log 2)^. Of these two values, only (1 -P log 2)^ > 1, 
so the maximum of g{x) on the interval [0,1] is (1 -I- log 2)^, attained only at the 
right endpoint 1. Using this maximum value to bound the denominator of (§ , we 
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obtain 

(6) Pr ||/(:Piv) - /b)| > e & P > < 2exp 

Turning our attention now to the case pn < Chernoff’s bound yields the esti¬ 
mate 

(7) Pr < exp • 

The problem is that the p in the numerator of the exponent Q may render this 
bound useless if p approaches 0. In order to ensure that this does not happen, we 
use the following claim: 

Claim. Assuming that p < ^ and pn ^ ^P, \fip) — f{PN)\ > e 
implies that p > exp(IT_i(—e)), where IT_i is unique real non¬ 
principal branch of the Lambert W function on (—^,0). 

First, note that if we fix pjv = ^P, then p = f is the unique value in the unit 
interval at which /{pn) = f{p)- Next, note that f{x) is decreasing from x = 0 
to X = e~^, where it attains minimum of = —e~^, and thereafter increases 

from X = to x = 1. Consequently, continuing to enforce the equality Pn = ^P 
letting p decrease from f to 0, we have /{pn) > f{p) for all p between 0 and 
Finally, keeping p fixed and letting pN vary in the interval (O, ^p), we also have 
Hpn) > f{p) for allp < i. So if |/(p)-/(pa)| > e, then actually f{pN)-f{p) > 0, 
and so /(p^v) > /(p) + £■ 

Suppose, in order to obtain a contradiction, that p < exp(IF_i(—e)). Then 
logp > VF_i(—e), so that, applying the function ze^ (which is the functional inverse 
of W-i) to both sides, we have plogp > —e. That is, /(p) > —e. Consequently, 
since we are also assuming /(pat) > /(p) -f e, we have /(pat) > 0. But this is 
impossible, since /{pn) < 0 for pat € (0,1). So contrary to assumption, we have 
p > exp(VF_i(—e)), and this completes the proof of the claim. 

From Q, the Claim, and the fact that the possibilities p < \ and p > \ are 
exhaustive, mutually exclusive possibilities, we then obtain 

( 8 ) 

Pr|l/(PAr) - /(p)| > e, &PN < < max ^exp .exp ^_ exp(bF-i^( e))fV ^ 

Now, we have two bounds, both independent of p, for the probability of {pat < jp} 
for the possibilities of pat > |p, (§, and Pn < \p, Since the two conditions on 
Pai, namely pn > \p and pn < ^P, are exhaustive. 


(9) Pr{|/(pAr) -/(p)| > e} < 
-Ne"^ 


2exp 


3(1-flog 2)^ 


-f max exp 


-N 


, exp - 


exp(lF_i(—e))A^ 
12 


an upper bound which is less than or equal to the upper bound given in the Propo¬ 
sition. □ 


We are now able to derive an effective version of the bound in [Hot93| . 


Proposition 2. Let e, 5 > 0 he given. Let 
'3(1 +log 2)2 


6) = max 


,24,. 


12 


exp(VF_i(-e)) 


• log 


3 

S' 
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Then 

PrdpAflogPAT -plogpl >€) < 6 . 

Note that N = log (j)) as e,S ^ 0. 

Proof. Solve for N in the bound of Propositionj^in order to obtain the bound 
on Nn{e,d). In terms of the notation we have introduced concerning the Lambert 
W function, the term involving the Lambert W may be written as By 

exponentiating Lemma we see that as e —>■ 0+, goes to oo at the rate 

O (e“^ (loge“^) log log (e“^)). Note that we actually obtain a tigher asymptotic 
for iV as e, (5 — >■ 0 than that stated in |H6f93j . which states the asymptotic N = 
o((i)^log(i)^log(|)). □ 

Now we are going to use Proposition to estimate from below the quantity 

Pn ( 7 ) for 7 > 77 > 0 . 

The reason that the Proposition applies to estimate this quantity is that, setting 


A = 7 — 7 > 0, 


we have 

/3p’’( 7) = Pry„,^p,{r(yAr) < 7} 

(10) =l-PrY,..p4T(Piv)>7} 

> 1 - PrYv~p'i{|T(yAr) - 77I > A}. 

There is a well-known relation expressing mutual information in terms of entropies 
(see e.g. (2.45), p. 21, of [CT06j L which in our notation says that 

( 11 ) t(p) = H{pa) + H{pb) - H{p) for all p^V. 


In other words, the mutual information of the joint is the sum of the entropies of 
the marginals minus the entropy of the joint. Further, each entropy on the right 
side of ( 11 ) can be expressed as the sum of several terms of the form pi log Pi, for 
Pi an event in some Bernoulli distribution. We can use this observation concerning 
the entropy to prove the following. 


Lemma 3. Let 7 > ry > 0. Let A := 7 — 7 , so that A > 0. Let Juv(A) be the 
function of A defined as follows. 


(12) J^Ar(A) 24exp ( —N min 


A2 


3-64(1 +log2)2’ 24’ 12exp(W(f)) 


Assuming that p^ € 'P 2.2 (i-e.. both of the marginals of p^ are binary random 
variables), we have 


(13) Pry„.p.{|r(Y 7 v) - t{p^)\ > A} < 

Proof. Using © and the comments following that equation, we have that 
t{p^) (resp. T{Yff)) is the sum, with appropriate signs, of a total of 8 terms of the 
form plogp (resp. p^ logpjv)- In order to obtain the desired estimate, we must use 
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the function of Proposition with e = gA, so that we obtain an upper bound on 
the probability of the event, 


Pr<! IpjvlogPAT -plogpl > ^ [■ < 


3exp —TV min 


A 2 


1 exp(fP_i(-A/ 8 )) 


_3 • 64(1+ log2)2’24’ 12 

We must also estimate the probability of 8 events in this way, so from the union 
bound, we have a factor of 24 outside the exponent. □ 

Proposition 3. Let 7 > 77 > 0. Let A := 7 — 77, so that A > 0. Let J^Ar(A) 
be the function of A defined as above. Assuming that S 7 ^ 2,2 (i-s- both of the 
marginals of are binary random variables), we have 

/3^’'(7)>1--^n(A). 


Proof. Continuing from pl)| , which bounds (3 from below in terms of a large 
deviation, we have 


(14) 


> 1 - Pry„~p’j{|T(Pjv) - t{p^)\ > A}. 


In order to lower-bound /3^ ( 7 ), we have to upper bound the probability of the 
empirical mutual information t{Y]\[) deviating from the true value 77 = t{p^) by 
more than A. The Proposition follows immediately from ( [T^ and (14). □ 

At this point we collect a number of useful facts concerning Fn{A) and related 
functions. First, in order to simplify the formulas, make the abbreviation 

Ki = 3-64(l-hlog2)2. 


Next, set 


Further, set 


F{A) = min 


1 


A^ 1 


Ki 24 12 exp W I 


F(A) = min 

Clearly we have the following relation 

F{A) = min 


A 2 


All’ 12W(f) 


—, A(A) 
24’ ^ ’ 


Fn{A) = 24exp(-AF(A)). 
F’(A),.F(A),.F7v(A) >0. 


and, by ( 12 ) 
and the positivities. 

Further we define 

(15) :Fn{A) := 24exp(-AA(A)). 

The point of this definition is that while Fm{A), and consequently Fn{A) is 
constant as a function of A for large A, Fn{A) and Av(A) are monotonic. 

Lemma 4. On R^, TV (A) is increasing and Fn^A) is decreasing. 
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Proof. We have that Fa?(A) is the minimum of two functions and we show 
that these two functions are increasing. The derivatives of the two functions with 
respect to A are 


d_ 




2A 


and 


9Al2expW(f) 8-12 


expW(f) 


> 0 , 


/A 

exp W — 


W' ( ^ 


> 0 . 


The reason for the second expression’s being positive is that W' is decreasing on 
R+, so that its derivative is negative. 

Thus we have that Fn{A) is the minimum of two increasing functions. Further, 
except at the unique point at which these two functions are equal F]\[{A) equals 
one of the two increasing functions, so is increasing. At the unique point at which 
the two functions are equal, Fpf{A), though not differentiable, is still continuous, 
so it is not difficult to see that F]sf{A) is also increasing at that point. From (15) 
Fat (A) is decreasing, □ 

Therefore, Fn has a functional inverse satisfying 


F] 


(f^(A)) = Fat (f^^A)) = a. 


Further, we define the functions 


(16) 

and 


eA(A)=Fjv(A)>0, 


GAiN) = Fn{A) > 0, 

which are the functions F and F with the index and argument swapped. Unlike 
Fat (A), Ga{N) is monotonic (decreasing) because 

G'AiN) = -FiA)-GA{N) <0. 

Therefore, there is a functional inverse t/^^(r) such that 

f;A(f/X'(r)) = ^;z'(^A(r)) = F. 

The functional inverses defined above are decreasing, a fact which we now record. 

Lemma 5. We have all three of Fj^^iV) and ^^^(F), and ^^^(F) decreasing as 
a function ofT. Further, the latter two are given by the formula 

0^i(F) = (log24-logF).[F(A)]-' 


(17) 

and 

0Z'(r) 

(18) 

where 

^X'(r) 

(19) 

and 

[F(A)]-i = 

r ~ n -1 

(20) 

F(A) 


n -1 


F(A) 


§, 24, 12expW 


= max 


Ki 


12 exp>V - 


A 2 


A 


8 
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Proof. For readability, we drop the subscripts N from the proof. All points 
at which is differentiable, the formula for the derivative of a functional inverse 
says that 

= -^< 0 , 




since Tn, by Lemmais decreasing. Further is differentiable at all but one 
point, namely the point Juv(A) for A the point at which the two expressions defining 
are equal. At that point is still continuous. Thus is decreasing at 
every point. 

As for , we derive the formula 0 by solving for N in Ga{N) = F, and 
observe that it is decreasing in F. Similarly for ■ □ 

We will use the following fact a number of times. 

Lemma 6. The bound 

(21) 7 -a(A) > r 

is equivalent to one of the following conditions being true: 

24 ~ 1 

A<24log—, or A<J'^^(r). 


Proof. Solving the bound (21) for F(A), we obtain that it is equivalent to 

^Qcr ^ 


By the definition of F(A), this is equivalent to 


24 


,m) 


< 


log^ 


N 


For the minimum to be less than the quantity on the right side is equivalent to one 
of the two quantities inside the minimum to be less than the quantity on the right 
side. Thus we obtain 

log^ 


24 

A < 24log—, or F{A) < 


24 

r 


N 


The latter alternative is (by multiplying by —A, then exponentiating) easily seen 
to be equivalent to .F(A) > F. We obtain the second condition given above by 
applying to both sides of this inequality and using the fact that is a 
decreasing function. □ 

Lemma [^implies the following: if A > 24 log then J^jv(A) > F implies that 

A<^^'(r). 

Corollary 1. Use the same notation as in Proposition^ Fix F € (0,1), a 
lower bound which we wish to impose on ( 7 ). Assume that 

24 

(22) iv>241og—. 

Then we have the following equivalent statements relating F, 7 , 77 , A. 

(i) . A > .F^^(l — F) implies that /3^ (7 + A) > F. 

(ii) . /3^ (?7 + A) < F implies that A < .F^^(l — F). 
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Proof. The two statements are contrapositives of one another, so we only 
have to prove one of them. We will prove (ii). Suppose that /?^ {ij + A) < P. By 
Proposition with 7 = ry + A, we have 1 — Tm{A) < P, which is to say that 
Tn{A) > 1 — P. Under the condition (22), this implies that A < — P), 

according to Lemma □ 

Now, let pI G V satisfy t(p|) = e. Unlike with we make no assumption that 
the marginals of the distribution pf are uniform. This is very important because 
in the applications p| will be a probability distribution in an unknown generating 
network, so the experimenter/learner can have no control over the marginals of pf. 
We are now going to apply Corollary [T] to estimate the probability of achieving a 
smaller than expected value /3^ where lon is formed from N samples from 

p\. Naturally, because ujn is an empirical sequence, our proposed upper bound 
P € (0,1) on the probability of a large deviation must be a probabilistic one, 
meaning it holds only with a certain probability (ideally as close to 1 as possible). 

Proposition 4. Let e,r] > 0 be given as above so that A := e — 7 > 0. Let 
P G (0,1). Suppose that N, A and P satisfy the conditions 

24 


(23) 

Then 


T]^\1-T)<A, iV>241og 


1 -P 


Pr, 


ujn ~Pl 


{/3P’'(r(w^)) < r} < (a - - P)) 


Proof. Apply Corollary [^ii) with t{uin) = p + A, so that A = t^ujn) — rj- 
Then the corollary says that if the condition N > 24 log holds, then 

{t{ujn)) < P implies t{ujn) - P < - P). 


Using (13), we calculate that, now with A := e — p as in the hypothesis 
{/?A ('r(wiv)) < P} < 

= Pr, 


Pr, 


U)N^V\ 




< Pr € 


< 




|r(a;Ar) - t] < 

|r(tLiAf) < e-A + .Fr^ 
{|r(c.^)-e|>A-J-^^(l-r 
' '(l-P) 


-1(1-P)} 

N (1-r)} 


)} 


A — 


where we have used the assumption, J^^i(l—P) < A, in the form A—^'^^1(1—P) > 0 
to obtain the next-to-last inequality. □ 

The condition in Proposition involves N in an implicit manner, but in most 
circumstances it will be preferable to express the conditions on N in an explicit 
manner. 


Lemma 7. The condition first condition of (23), namely thatiPj^ (1 —P) < A, 


is equivalent to 
(24) 


N > 


F(A) 


n -1 


log 


24 

1-P' 


Further, the conjunction of the conditions found in (23) is equivalent to the condi¬ 
tion 

1 - 1 , 24 


N > [F(A)]-Uog 


1-P' 
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Proof. Because is decreasing, —P) < A is equivalent to < 

1 — r. This is equivalent to Ga^N) < 1 — P. Solving for N and using Lemmawe 
obtain 


N>gz\i-r) 


F(A) 


log 


24 

l-P’ 


For the second statement, (19) and (|20[) together imply that 


[J^(A)]"^ = max(24, F(A)). 


□ 


From Proposition]^ and Lemmawe obtain the following more usable form of 
Proposition 

Corollary 2. Let e > rj > 0, A := e — rj > 0, and F £ (0,1). Then for all N 
large enough, specifically 

1 24 

7V>[F(A)]-'log^^, 

we have the following probable upper bound on (3^ : 

{at! (r(cc^)) < r} < (a - - F)) . 


3. Statement of Finite Sample Complexity 

There are four problem parameters in terms of which we express our finite 
sample complexity result. We have already explained the minimum edge strength e. 
The second is the error probability <5. Next is the minimum separation m of certain 
local probability distributions associated with p. Two versions of this parameter m 
will be introduced in the statements of the Theorems below. We discuss the final 
parameter (f, which is only relevant in the case of more than two random variables, 
before stating the result for n variables. 

Now we come to the hyperparameters, the first of which is k. For the finite 
sample complexity result, we concentrate on the case when the weighting func¬ 
tion 'ipi{N) = nlogN for K > 0 a constant, and 'ip 2 {Af) = 1. In this case we 
refer to more simply as We make this choice because then, with 

the choice k = ^ the initial part of the objective function (without the sparsity 
boost) becomes equal to the MDL objective function considered by, among oth¬ 
ers, [FNP99j . Although [FNP99j considers a number of other complexity-penalty 
weighting functions tpi {tp in their terminology), such as ipi = k (constant) and 
ipi = K,x°‘, 0 < a < 1, we will postpone consideration of these until later work. We 
note here only that our proof of finite sample complexity adapts, in the case n = 2, 
in a straightforward manner to a proof of finite sample complexity in the case of 
constant weighting function, when ipi = k, which contrasts somewhat with the set¬ 
ting of [FNP99] because when the sparsity boost is not present, there is no known 
finite sample complexity result in the case of constant weighting function. Because 
the proof is currently complete only for the case of networks consisting of binary 
variables, we state it only in the case fc = Z = 2, for each node. We believe that 
analogous formulas can be derived for the case of k,l > 2, but leave the technical 
details for future works. In the two-node case, the remaining parameters are the 
hyper parameters 0 € (0,1). The choice of these hyper parameters affects the 
constants, but not the asymptotic growth properties of N{e, (5; rj, k; A; 0, /i). 
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Theorem 1. Let a set of nodes V = {A,B}, with each node representing a 
binary random variable in X ■.= so that k = l = 2 and 

|X| =kl = A. 

Consider the family Q of Bayesian networks {Gq = {G$,Po), Gi = {Ga^b, Pi)}, 
where Gq has no edge between A and B and Gi has an edge with edge strength at 
least e > 0. Let 6 > 0 be a given error probability. Set ipiiN) = Klog(iV) and 
'ip 2 iN) = 1; with K a positive constant. Choose rj G (0, e) so that A := e — ij > 0. 
Choose parameters 0,/i G (0,1). Denote by 


F{A) := min 


A2 


1 


and 


so that 


and 


ATi’ 12expW(f) 

F{A) := min , 


Ki := 3-64(1 +log2)2, 


F(A) 


= max 


Ki /A 

^,12expW - 


[T(A)]-i = max(24, [F{A)]-^). 
Choose A G (0,1) satisfying 

F (/IT?) 

(25) A < -, eguivalently, F{fJ,r]) — Xrj > 0. 

t] 

Let N be larger than all the following quantities. 

(26) -.))]-log f, 


(27) 

(28) 

(29) 


max ([F (A??)] ^, [F(e(l - A))] log 


48 

T 


1 24 

[T(A)]-'log 


1-e’ 


F ( - 
2 


-1 




1 -1 


1 48 

logy, 


1 24 


(30) 


— 

eX \ K 


Then with probability 1 — <5, we have 


S'^(Go,wjv) — Sj^{Gi,ujn) 


>0 ifG = Go 
<0 ifG = Gi 
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In examining this result, one question that may arise is how to choose the 
“free” hyperparameters 0,/r, and in particular, is there obvious value, for example, 
very close to 0 or very close to 1. The answer is that there is no such obviously 
optimal value. In order to see this, consider the case of 0 and the two conditions 
in (28): in one 0 appears in the denominator, and in the other 1 — 0 appears in 
the denominator. Similar comments regarding /r can be made in relation to (26). 
If it were important to determine the optimal values of these hyperparameters, we 
could carry out a numerical optimization procedure to find the optimal value of, 
for example, 0, as a function of A. However, we are primarily interested in the 
asymptotic dependence of N on problem parameters e, 6, C,, rather than the specific 
value of iV. So we merely note that a choice of 0 = ^ | will lead to a specific 

sample size function N = N{e,d]ri, k; X) which has the properties stated in the 
Theorem. 

Here is the corresponding asymptotic statement: 


Theorem 2. Let a set of nodes V = with each node representing a 

binary random variable, so that k = I = 2 and 


|A| =kl = 4:. 


Consider the family Q of Bayesian networks {Go = (G0,Po), Gi = {Ga^b, Pi)}, 
where Go has no edge between A and B and Gi has an edge with edge strength at 
least e > 0. Let 6 > 0 be a given error probability. Set ipiiN) = Klog(iV) and 
ip 2 (,N) = 1 with K a positive constant. Choose rj G (0,e) to be a fixed multiple of e, 
so that A := e — rj is also a fixed multiple of e. Choose parameters Q,pL G (0,1). 
Then there is a function N(e, 6; rj, k; A; 0, p,), 

N{e, S; p, k; A; Q,fi) GO log as e, A 0+. 
such that with probability 1 — A, we have, for N > N{e, A; p, k; A; 0, /i), 


S'^(Go, wjv) 


Sri{Gi,UJN) 



ifG = Go 
ifG = Gi ■ 


Proof. The asymptotic dependence on A as A —>■ O'*' follows from, for example. 


(26) , or any of the other conditions in which the factor log ^ appears. 

The asymptotic dependence on e is determined by (25) and (27). First, note 
that, for the purposes of determining asymptotic dependence as e, and therefore p 
approaches 0, we may replace F{p) with the term p'^. Therefore, (25) says that A 
is of order p^/p = p as p ^ 0^. Therefore, the first element in the maximum in 

(27) is order [F(? 7 ^)] As A —)• 0+, for the purposes of determining asymptotic 


dependence on A, we may replace [F(A)] ^ with A Consequently, [^(ry^)] is 
of order {p^)~'^ = ;^ as ry —)■ 0. Since p and e are of the same order, we reach the 
conclusion that N is of at least order ^ as e —>■ O’*". All the remaining conditions 
are clearly O(^) for A: < 4. □ 


1 -1 


Note that the cause of the asymptotic dependence of A on e being N = O (^) 
is because of the case when G = Gq. The following Theorem shows that if we use 
a different technique, namely Sanov’s Theorem, instead of Chernoff’s Theorem, for 
the estimates in the G = Gq case, the dependence becomes A = O (^). 
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Theorem 3. Let V,k,l,Go,Gi,4’i,4’2,i,S be as in Theorem^ Assume that 
the underlying network in {Go,po), so that u;jv ** sampled from an in¬ 
dependent distribution over a pair of binary random variables. Choose 
V e (0,e). 

(a) Let y,X G (0,1) such that 

F{prj) > Xy. 

Then for any N greater than 


S, K, A, p) := max 


log 24 n 1 48 


we have Sjj{Go,uin) > Sri{Gi,u!N) with probability at least 1 — <5. 

(b) Define 

+ ^25 • 5 ;^ - 2400(|X - ' 


Vn ■— 


48 


Let p G (0,1) be a free parameter. Let the number of sample points be 
larger than 

|X| 


N^{r], S, K, p) := max 


’ ^(96(2r, + l)(|X|-K,{l-n)) )) W I 


’^N 


Vn l|X|exp(^) 


Then with probability 1 — <5, we have Srj{Go,ujN) > Srj{Gi,ujN). 

(c) The functions above have the following asymptotic properties. As N ^ 00 , 

- ^ 

144(27; + 1) ■ 

As c, S —y 0; 


and 


N^{S; 7 ;, k; A; /i) = O ( 4 log 7 1 , 


N^{S-,y,K]p) = 0 ( ^logi 1 • 


Combining the estimates from the Sanov bounds in the independent case Cher- 
noff bounds in the dependent case, we obtain result with the best possible asymp¬ 
totics. 

Theorem 4. Let e > rj > 0 and A := e — rj > 0. Let A,/i, 0 G (0,1) be free 
parameters. Let IV > 0. Define 

(- 4 ^ 1 + 2400(|X - ' 


■ — 


V 


48 


Suppose that N = N(e, S; y; k, A, 0, p) is larger than all of the following quantities 

( |X| 


max 


''( ’ ’'^(96{2r, + l)(|X|-K{l-^)))) 


Vn 


W 


l^loxp(^) 
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24 


F( ^ 


1 48 

log^ 


[F(6(l-A))]-Mog^, 


Then with probability 1 — <5, we have for 

S'^(Go,Wjv) — Sri{Gi,UJN) 


> 0 
< 0 



24 

■0’ 



ifG = Go 
ifG = Gi ■ 


Further, 


N{e,6;r);K,X,e,p,) = O 



as e, S —>■ . 


Here is the finite sample complexity result we obtain for the case of n nodes, 
using only the concentration of mutual information results derived from Chernoff’s 
bounds. 


Theorem 5. Let 5 > 0, Q = , and e > 0 the minimum edge strength of the 

network G with respect to Q and §. Let rj € (0, e) and A := e — rj > 0. Let G G Q‘^ 
be a perfect map for P. Suppose that the separating collection § is defined by 

Sa,b{G) := {All subsets ofV — {A, B} of cardinality d{, 
so that in particular 

a{Q) := max (t(G) is bounded by a constant, i.e., d; 


Ss(e) := 


U Sa,b{G) 


Geg 

(A,B)eV^\A 


is bounded by a polynomial in n of degree d, i.e. 


Let F, F be as defined as in Theorem\^ and let 
NJe.i) = max [ ^(* +‘""f .24.12exp 


(w( 


jl2d+2 


•log 


3-2-^+n,:i)(n-d) 

6{n — {2d + 1)) 


Set the weighting functions tjji{N) := nlogN and ip 2 {N) '.= 1, n a constant, e.g.. 


K = 


(a) Let ( > 0 be a given error parameter. Let 9,Q G (0,1) be free parameters. 
Let 


where 


m = mp{G,G,§)= max mp{{A, B),G,§), 
{A,B)eG 


-1 


mp{{A,B),g,§) := max [^(5 = 5)] 

sGVal(S) 

Take N larger than all of the following quantities: 

.r fC 5\ 
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/ \ 


3 __3|£;(G)|Ss(02"^^^ 

(1 - e)e^ ® 5 


max F ( — log 


1 - 0 ’ 


72\E{G)\^s{g) 


Then with probability 1 — <5, 

Srj,NiG,ujN) > Srj,N{G', ujn), foT all G' such that F[{P,pqi p) > (^. 

(b) Let /i, 0,6* € (0,1) be free parameters, and let L be a positive integer 
L < card(G'\G). Let 

mp(G,§):= max mp(Sp((A,B),G)), 

(A,B)eyx2\G 


S := Sp{{A,B),G) G Sa,b(,G) sueh that A _LL B\S andmp{S) as small as possible. 
For compactness of notation, we abbreviate as follows: 

(32) rh := mp{G,2>), m ■.= mp{G,g,$). 

Let N be larger than all the following quantities: 

(33) max m{l - 9)-^ [F{A)]~^ log N„ (’ 


5\E{G)\J:8{g)2AG) 


-log- 

(1_6I)6»2 <5 




3 , 5GEs(G)2‘"('^) 

jj-^log-J- 


120|i;(G)|I]s(C 


120cts(G)L 

6 




4m log 24 


4mK(|G| — n) 
(1 - 9)LFiyip) 


E(G) 

{l-9)LF{pir))e(\o\-r.). 

4mK(|G| — n) 


Then with probability 1 — <5, 

Sp^n{G,ujn) > Sp^n{G',ojn), for allG' such t/ioi Skel(G') (f. Skel(G). 
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Remark 1. The reader is cautioned not to confuse the following two quanti¬ 
ties which appear in Theorem^ and related discussions: first, |G|, the number of 
parameters of the Bayesian network associated to graph G, and second, \E(G)\, the 
number of edges in the graph G. In particular, the bounds on these quantities for 
G ranging over are different. On the one hand, we have |i?(G)| < nd, and on 
the other hand, we have |G| < , for all G G G‘^■ 


Using an analysis based on Sanov’s Theorem in place of Chernoff’s Theorem, 
we can obtain a refinement of part (b) of Theorem with improved asymptotics 
(in terms of e): 


Theorem 6 . Let 6 > 0 be given. Let G G G‘^ and let m be as given in (32|. 
Let e be the minimum edge strength of G in G,Si, as above. Let rj G (0, e). Let 
0,Q G (0,1) be free parameters and L < card(G'\G) as above. Let N be larger than 
the following quantities 


(40) 


max 


to (1 — 0 ) 


-1 


[T(A)]-'log 


24 

1 - 0 ’ 


Nr,. 


1-1 


V 


rh 800 2?7 + 1 ’ 6 


where Nn{e,5) is the function defined in Theorem^ 


9ffl-0) 


log 


3|£;(G)|Ss(G)2'^(g) 




^(1 \^16(2r; + l)|A|exp(^J^) (;^) 


i/l^l 


Then for 


800m(27? + l)(N(|G|-n)-L|A|) ^^ 


(1-6')L?7 

W 


(1 - 0)Lr, 


E(G) 

_0 (K(lG|-n)-i|X|) 


800m{2r] + 1)((k(|G| — n) — L|A|)) 
we have with probability 1 — 5, 

Sr,,N{G,ujN) > S'„, 7 v(G',W 7 v), for allG' such thatSke\{G') Skel(G). 

We can sum up the asymptotics of all these functions as follows: 

Theorem 7. Let (G,P), G'^,c,rn,iri,m,S, be as in Theorem Let 

N{m,n;5'f;r]',K,Q,9,L,iJ,) be the function of part (a) of Theorem^ Define 
N^(rh,m,n\5',ri]K,9,Q,L) to be the maximum of the functions bounding N from 
below in Theorem^b). Define N^{rh,m,n;5;ri;K,9,Q,L) to be the maximum of 
the functions bounding N from below in Theorem^ Then we can sum up as follows 
the asymptotics of the functions appearing in Theorems^ and^ as 

C, e, d—>-0^, and m,rh^oo. 

(a) From part (a) of Theorem^ 


/ / lo£(77,)77i 

N{m, n; 5; ly; k, 0, 9, L, p) G O I max I - - - , 



' log 
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(b) From part (b) of Theorem 


/ / / ^ \^\ 

N^{m,m,n;6;r];K,Q,0,L,fj.) e O max —, ( ) -log 


£2’ V £2 J 


(c) From Theorem^ 


m / nm 


N^{m,m,n;S;ri]K,Q,9,L) G O I max I —, ( ~ ) 1 ’ j 




Proof. We begin by collecting some elementary observations that will be used 
in the proof of all parts. First, let the function A^„(e, S) defined in Theorem]^ Then, 

(41) Ar„(e,<5) = o(^Q)' 

Second, from the definition of [F(A)1“^ as the maximum of a constant factor times 


O (^) as A —0~^, we have 


A 2^ a constant, and a constant factor times expW (^), which is by Lemma 


28 


logj 


as n 


00, e, 


[F(A)]-i 




as A ^ 0+. 


Since A := e — 77 and we may choose 77 to be a constant multiple (in (0,1)) of e as 
e —>■ 0+, the asymptotics of any function /(A) (or of f{r])) as e —)■ 0+ are the same 
as the asymptotics of /(e) as e —>■ O’*". Therefore, we may actually write the second 
observation as 


(42) 


[F{A)]-\ F(A) ,[F{v)r\ Hv) 



as e —>■ O'*". 


Finally, since we are assuming that Ss(9) is (bounded by) ())), elementary combi¬ 
natorics says that 

(43) Ss(g) = 0(n‘^), logEs(S) = 0(log77), as 00 . 


(a): ^ 

By (411, we have Nn ( 3 ; i) = ^ (^(?) as n —>■ 00 and C, <5 —)■ 0+. 

But N(- ■ ■) the maximum of this function and the other functions listed 
in part (a) of Theorem]^ The last of these functions, once the logarithm is 

suitably expanded, amounts to a constant multiple of m [F'(A)]~^ log ■ 

and by (42) and (43), we have 


^ J V 


as 771, 77. —>■ 00 , e, 5 —>■ O’* 


For the max of these two functions, we obtain the asymptotics of N{- ■ ■) 
claimed in part (a). It is not difficult to go through the other functions in 
the maximum defining N{- ■ ■) and verify that these two functions domi¬ 
nate all the other functions asymptotically. This completes the proof of 
part (a). 
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(b): 


Using (41), (42), and (43), we can see that (33) is 


( 44 ) 


O max 


m /nm \ , 1 

log 7 


, asn, m, 771 —>■ oo, e, 


and that (351 is 


O 


fhlog{n) 


, asn, m —)■ 00 , e 


0 +. 


But the latter function is dominated by (|44|). Furthermore, it is not diffi¬ 
cult to verify that 


through (39), the other functions whose maximum 
defines N'^{m,m,n;6;r];K,Q,9,L,fi) are all dominated by (44). So the 
asymptotics of N^{m,m,n]S;ri; k,Q,9, L, are given by (44). 

(c): 


(45) 


Using ( |41[ ) and ( |42[ ), we can see that ( |40[ ) is 

\ 2 
I \ 


O max 


m / nm 


e J 


log 


1 


, as n, m, 771 —)■ 00 , e,6—>■ 0+. 


It is not difficult to see that the other functions whose maximum is used 
to define N^{m, m, n; S; rj-, k, 0, 9, L) all have asymptotics which are dom¬ 
inated by ( [44| . Thus, the asymptotics of S;r]- k,Q,9, L) are 

determined by (45). 

□ 


To obtain the asymptotic statement for N(e, m, m, n; S; Q in the Introduction, 
take the maximum of the functions in parts (a) and (c), setting the parameter L = 1 
throughout. This is an allowable choice because for all G" S Q‘^, such that Skel(G') 
properly contains Skel(G) we have L > 1. See Section for further comments on 
the relation of our results in the literature. 


4. Proof in the case of two binary random variables 

We begin by proving the theorem for the special case where V has two nodes, 
because there are fewer technical complications and the essential ideas are still 
visible in this case. The most significant reusable parts of the proof in the two- 
node case are Corollaries and The reason is that these two results give lower 
and upper bounds, respectively, on the sparsity boost in the case when the two 
variables being tested are independent, respectively dependent. This will allow us, 
in the general case, to show the following: first, that the absence of the sparsity 
boost from the objective function for a false network with an extra, false edge 
will cause that false network to quickly (in terms of number of samples) lose out 
to the true network. Second, that the presence of the extra sparsity boost in a 
false network which is missing an edge in the true network, will have its positive 
influence on the objective function of the false network quickly die out. Thus, these 
two propositions are key in showing that the sparsity boosts, collectively, quickly 
winnow down the space of networks with large objective functions to networks with 
the same skeleton as the true network. 

In the case of two-node networks, there are only two distinct Bayesian network 
structures: 
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• Go, the disconnected network on two nodes, with possible corresponding 
probability distributions Pq G Vq. 

• Gi, the fully connected structure on two nodes, with possible correspond¬ 
ing probability distributions Pi G Pe- 

The inequality we are trying to prove in this case is equivalent to 

(46) SrjiGo, GijLOn) ■= 5,,(Go, LOn) — Sri{Gi,UJN) 

We calculate that 

Sri{G(j, GijUJpf) = LL(Go, Wat) — LL(Gi, wat) 

- V'i(^)(|Go| - |Gi|)-hV' 2 (^)( 0 - max min ln/3^'’(r(p(a;|s)))) 

sGval(o) 

= -NH{p{ojn)\\p{u}n)o) - tpi{N){2 - 3) -b ip2{N){0 - In/?^” {t{p{uj)))) 
= -Nt{u}n) -b Klog(A^) - (t(ujn)) 

Note that in the two-node case, V = {A, B}, the dependence of the sparsity boost 
on the mapping Sa,b{-) in effect disappears, because the only subset oiV — {A, B} 
is the empty set. This accounts for the relative simplicity of the objective function 
in the two-node case. 

As a result of the simple form of the objective function in the two-node case, 
as reflected in the above calculation, we can decompose the function Srj into the 
composition 

N i-A t{uin) ^ 5jv,,,(r(a;Ar)), 
where -A R is defined by 

(47) ■■= -Nj + K \og{N) - log ( 7 ). 

The significance of this decomposition is that whereas the first function N i-A- t{ujn-) 
is a random variable, the second function 7 1 —>■ 5'^w(7) a deterministic function 
of the test statistic 7 := t{ujn)- 

4.1. Independent Network. Suppose that the generating network is (Go, po) 
with po G Po St product distribution with arbitrary marginals. Recall from the 
above derivation that in the two-node case 5^(Go, Gi^lum) has the following simple 
expression, 

5,,(Go,Gi, Wat) = SN,r^{T{ujN)) ■= -Nt{wn) - log,5^”('r(wAr)) + nlogN. 

The main points of the proof will be to show that with “high” probability (to be 
quantified below) 

• The mutual information t{ujn) converges to 0 fast enough that for mod¬ 
erately large N it is less than some multiplier A times rj (Proposition]^. 

• The complexity penalty — log/3^ (t(ujai)) grows linearly with JV for mod¬ 
erately large N with “slope” T > 0 depending on 77 , but not on po (Corol¬ 
lary |4| 

From these points, stated precisely and proved in Propositionsand Corollary]^ 
below, the finite sample complexity result will follow easily. 


>0 ifG = Go 
< 0 if G = Gi ■ 
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Proposition 5. Let A € (0,1) be a “multiplier” for p. Let p > 0 and po € Vo- 
Then 

{t(wjv) > At?} < VNiXi])- 
Proof. Because po e Vo, so that t{po) = 0, we have 

Pra;jv~po {r(a;Ar) > At?} = Pr^„...po {^(wa) - r{po)\ > At?} < J'a(At?). 

The latter inequality follows from Lemma □ 

In order to show the linear growth of the complexity penalty in N, it suffices 
to show the linear exponential decay in N of (ojn)- Thus we start with the 
following. 


Proposition 6. Let t? > 0, and ehoose fi G (0,1), so that t ?(1 — p.) > 0. Let 
P > 0 small enough and let N a positive integer large enough that they satisfy 

(48) 

Then we have 


(49) 


0% (^(1 - h))<e 


-NT 


Proof. Using the definition of as the CDF of r, /3^ {p{l — j])) is 

Pn (^(1 - h-)) = Pl'Yv-p') {T(yA) < T? - /TT?} 

< {^(Pa) -v\> tv} 

< Vn{tv), 

where we have used Lemmain the last inequality. So /3^ ( t ?(1 — fT)) < J-NijIT), 
and using (481 we obtain the conclusion (49). □ 

From the upper bound on /3 in Proposition]^ we readily derive a (probable) 
lower bound on the sparsity boost for the independent network. 

Corollary 3. Let T,p £ (0,1) be as in Proposition^ in particular satisfying 
condition (48). Then 


Pr 

A i o 


{-log/3 a < ivrj < Pn{v0 - t))- 

Proof. By assumption (48) and by Proposition]^ we have 

e-^^>/3^''(T?(l-/r)). 

So if t{u}n) < V0 — t) then because /3^ is increasing, we also have also 
Thus, 


Pr, 


tJN~PO 


{-log/3A < /vpj 


= Pr 


ujn^Po 


{/3a 0i^N)) 


> e 


f 


< Pifo„~po {'^(wa) > t? - t??t} 


= Pr,. 


{|t(wa) - r(po)| > T?-T?/i} 


<^n{v-VT), 


where we have used (13) to obtain the final inequality. 


□ 
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For a given N, 77 and chosen G (0,1), set 

(50) rj) = sup {r G (0,1) I TNim) < . 

By solving the inequality defining /i, 77 ) we obtain the formula, 

( 51 ) r»“(«.,,,) = ^ -log[24expOiVF(M))l _M ^ 

The reason for defining this quantity is that /i, 77 ) is the largest F such we 

can apply Corollary In terms of this new quantity, we can state Corollary in a 
form that is more convenient for the applications. 


Corollary 4. Let 77 g (0,1), and let 

(52) Fg (0,r““(iV,77,77)), 


with F““(A^, p,, 77 ) as defined in (50), so that in partieular, N,T,pL,ri satisfy (48). 
Then 


Pr, 


1 ^N~P0 


{-log/3^'’(r(a;jv)) > iVFj > 1 - ^■^( 77(1 - fi)). 


Combining Corollary with Proposition we have 
Proposition 7. Let X,yLG (0,1). Let 

(53) 0 < F < r'"“(7V,77,77). 

Suppose that the generating network is (Go,po)- Then with probability at least 

(54) 1-J-^(A77)-J-w(77(l-/7)), 


we have the lower bound. 

S.,{Go,Gi,iUN) > {T - Xr,)N + K\ogN. 
Consequently, provided that 

(55) F > A 77 , 

then we have with probability at least ( |54| ), 

Sr]iGo,Gi,UJN) > 0 . 


This concludes the particular treatment of the case G = Gq. Next we 
deal with the other case, when G = Gi- 


4.2. Dependent network. In the case when G = Gi, our goal is to show 
that, with high probability, §,,(Go, Gi, wat) < 0. 

First, by definition, we have 

S,,(Go,Gi,W at) = -log/3^’' (ujn) - Nt{ujn) + ulogN. 

From Proposition]^ we derive the following: 

Corollary 5. Let e > 77 > 0 and let A := e — ij > 0. Let 0 G (0,1) with the 

following eonditions satisfied 

~ , 24 

(56) ^^^^^(l-e) < A and iV > 24 log 

Then, 

Pr<.,;~p» {-log/3(r(a;Ar)) >log0"^} < J'Ar(A - - 0)). 
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to 0, 


Proof. Proposition [^implies that under the conditions (56), with P set equal 


Prcd - p ' {-logP{ t{ujn)) > log 0 = Pr ,^.^ p = {/3{t{ujn)) < 0 } 


□ 


Consequently we have the following estimate of §,;(Go, Gi, wat). 


Proposition 8 . Let e> rj > 0 and let A := e — rj > 0 with the conditions (56) 
satisfied. Then we have 

Sr,{Go,Gi,u}N) < log0“^ - Nt{u}n) + nlogN, 
with probability at least 1 — 

Let A S (0,1) be a “multiplier”, meaning that we are going to choose N so 
that t{ujn) > Xe, the A-proportion of its expected value, with some (close to 1) 
probability. The tradeoff in the selection of A is that the closer A is to 1, the larger 
N will have to be to achieve t{uin) > Xe with any specified probability. 

Proposition 9. With X G (0,1) a fixed multiplier, we have 


Pr. 


UlN~Pl 


{t{ujn) > Ae} < 1 - J ' Ar((l - A)e). 


Proof. 

Pr 


UlN^Pl 


> Ae} = 1 - Pr, 
< 1 - Pr 


UlN'^Pl 


OJN^Pl 


{t{u}n) < Ae} 

{|T(wAr) - e| > (1 - A)e} 


< 1-J-^((1-A)e), 

where we have used Lemma in the last line, applied with rj := e, j := Xe (so 
A = e(l-A)). □ 

By the union bound. Propositions and under the conditions (56) we have 
the following upper bound on Srj{Go, Gi,uim)'. 

(57) S',,(Go,Gi,W at) < log0“^ - A^Ae + nlogN, 
with probability at least 

1 - Tn{A - - Tn ((1 - A)e). 

Thus we obtain the following final sample complexity result for the case of G = Gi. 

Proposition 10. Let X S (0,1) be a fixed multiplier. Let e > rj and define 
A := e — rj. Let 0 G (0,1) be such that the conditions (56) are satisfied. If in 
addition 

(58) N > 

eA 


eA0- 


we have 

S,,(Go, Gi, Wat) < 0, with probability 1 — J-n{A — — J-jq ((1 — A)e). 


Proof. Dividing (57) by —k, we have S,, < 0 if and only if 

^iV-logJV- ‘°«^" >0. 

K K 


Solving this condition for N we obtain (58). 


□ 
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4.3. General Case. Putting Propositions and together, we have the 
most general form of the finite sample complexity result for 2 variables. 

Proposition 11. LetQ,X,fj.e (0,1). Let 
(59) re (A7?,r““(7V,M,r;)), 

where 

pmax 


24 


(Assume, for now, that the interval in (521 is nonempty.) Let N be a positive 
integer, such that the following conditions are satisfied, 

24 

(60) >24log- 


(61) 

(62) 


1 - 0 ’ 
.F^i(l-0) < A, 


N > —W 
eX 



Then with probability 1 — S{e, rj, A, p, 0, A, TV), we have 


St)(,Go, Gi,ujn) 


> 0 
< 0 


ifG = Go 
ifG = Gi 


Here S{e,r], X, p,Q, A, N) is defined as 

(63) max {HNiXp) + J'iv(?7(l - p)), Fn (A - + Fn ((1 - A)e)) 


Proof. Gather together all the conditions of the Propositions: (60) and (61) 
account for (56). Condition (52) is the combination of (53) and (55). Finally, 
(62) comes from (58). Thus (52) through (62) account for all the conditions in the 
Propositions. 

Since G = Go or G = Gi is an exhaustive set of conditions, the error probability 
is the max of the error probabilities in these two cases. □ 

This most general formulation of the finite sample complexity is evidently un¬ 
satisfactory as it stands for three reasons: it is not clear that the interval (52) is 
nonempty, so that there is a possible choice of P. Secondly, in order to study the 
asymptotics of TV, the number of samples, we need all the conditions on TV to be 
expressed explicitly, with “TV > ” on the left side and expressions involving all the 
variables on the right side. Third, since 5 is a choice of the experimenter, we should 
not have S expressed in terms of the other parameters, but instead we should have 
TV expressed in terms of the desired confidence 1 — (5 and any other parameters. 

As for the first task, concerning the choice of P, we have to show that there is 
a choice of X,N,p so that the interval (52) is nonempty, in other words so that 

(64) Xt] <T'^^^iN,p,T]). 


Lemma 8. The condition (64) is eguivalent to the following two conditions 
(65) F{p'q) > Xp, 

and 

log 24 


( 66 ) 


TV > 


F(pp) — Xp 
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( 67 ) 


Proof. For the interval in (521 to be nonempty, it is equivalent to have 

logs 


Xrj < -- 


N 


Fifir]). 


For a fixed choice of A, /i, there will be an N satisfying (67) if and only if (651 holds, 
because the term — can only be negative. If (65) does hold, then we can solve 
(67) to obtain ( 66 ). □ 

The final point to address is how to make the error probability 
(5(e, 77 , A,/X, 0, A, TV) less than a 6 set by the experimenter. There are many ways 
of going about this. We give just one that makes it possible to deduce the asymp¬ 
totics. Namely, we assume that S is given and we require each of the four terms in 
(5(e, 77 , A, fj., 0, A, N) to be less than |. Further, we have to bound away from zero, 
the expression A — appearing inside the J-j^ in the second term in the max¬ 

imum. In order to do this, we simply require that .F^^(0) < or equivalently, 
because is decreasing, 


Fn ( w 1 < 0- 


( 68 ) 


Because Fn is decreasing, it will suffice to replace A — ^'^^^(0) inside with its 
lower bound and require that 


Fn ( ^ 


<5 

^ 2 ' 


Thus we arrive at the five conditions stated in the following Lemma. 

Lemma 9. Let S > 0 be given. In order to achieve i5(e, 77 , A,/i, 0, A, A) < 5 it 
is sufficient to take N large enough that the following four quantities are hounded 
by 5/2 


(69) 


Fn{Xt]), Fn{v{^ - h)), ( W ) > “^Ar((I - A)e) < -, 


and in addition take N large enough so that the following holds, 

'A' 


(70) 




N 


< 0 . 


The following simple Lemma allows us to transform these conditions into ex¬ 
plicit lower bounds on N. 

Lemma 10. Let A,B > 0. Then the condition 

Fn{A) < B 

is equivalent to the condition 

Ga{N) < B. 

Solving for N, we have that both the conditions are also equivalent to the condition 
N > g^\B) = (log24-logi?)[F(Al)]-b 
Proof. Use (16) and then Lemma□ 


Applying Lemma 10 to (69) and (|70|), we have the following explicit bound for 


N, in terms of the given 6, for the error probability to be less than 5. 
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Lemma 11. Assume that N satisfies the conditions 


N > max I [F (A??)] ^, [F{r]{l — ^))] 


1-1 




T(€(i-A))r‘ 


log 


48 

T 


and 


N > 



Then S{e, 77 , A, fi, 0, A, N) < 6. 


log 


24 
■0 ■ 


Completion of Proof of Theorem [TJ In the Theorem, we have set an ar¬ 
bitrary S S (0,1). We have assumed that G = (Go,Po), Po G Vq, he. t{pq) = 0, or 
that G = {Gi,pi), Pi S Fe, that is t{pi) > e. Let p € (0,e), so that A := e —17 > 0. 
Since the conditions (26 1 and (25) in the Theorem imply that 


N > 


log 24 


F(pp) - A 77 


> 0 , 


we can define 

(71) r“-(iV,/r,r7):=-i^+F(7rr7). 


and then the interval C (0,1) is nonempty. Choose P S (Ar 7 , P™*^’^). 

Apply the most general form of the finite sample complexity, Proposition |ll[ Then 
use three lemmas in this section. Lemma Lemma |10[ and Lemma El to obtain 
the conditions given in Theorem □ 


5. Proof in the Case of n binary random variables 

The proof in the case of n nodes is, in most respects, an elaboration of the 
proof in the case of two nodes with some additional combinatorial arguments. In 
order to explain this, we introduce some notation convenient for the expressing the 
difference of objective functions, which is denoted, for G, G' S Q, by 

S^{G,G',lon) = S^{G,ujn) - S^{G',ujn)- 


Usually, we will think of the network G in the first position of the difference as the 
“true generating network” so that we are trying to show the (probable) positivity of 
§>^{G, G',ojn)- By A(G, G') we denote the symmetric difference of the Skel(G) and 
Skel(G'). Further, we define a sign associated to any edge (A, B) in this symmetric 
difference by 


ai{A,B),G,G') 


-bl for (A, 5) eG'-G 
-1 for (A,5) SG-G', 


(which can simply be thought of as 0 for (A, B) ^ A(G, G')). We then define the 
set 


Sa.b{G,G') 


Sa,b{G) for (A, F) g G'- G 
Sa,b{G') for(A,F) e G - G'. 
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Then in this notation, we have 


(72) G',iJN) = LL(a;jv, G) - LL(w^, G') + MN) {\G'\ - |G|) + 


MN) 

MN) 


a({A, B),G,G') max min —In t{p{ojn,A,B\s)) 

^ SgSa b{G,G') sGval(S) [ JV \ \ 

{A,B)eA{G,G') • ^ ^ ' 


E 


(A,B)^E{G)yjE{G'\ 


max — max 


\seSA,B(G) seSA,B(G')/ \seva.i(S) 


-In 


/3^t(p(ujjv,A,BIs)) 


Corollaries and [^continue to provide the probabilistic information we need in 
order to estimate the second line of this expression, that is the part of 5,,^Ar(G, G') 
coming from the sparsity boosts. In contrast, the situation with the first line of 
this expression, in particular, the difference of log-likelihood terms, is somewhat 
different in the n-node case. Recall that in the two-node case, the difference of the 
log-likelihood terms is known to be precisely —N times the test statistic 7 , while 
in the n-node case, it is clear we cannot hope for such a precise formula, and we 
must find estimates instead. Fortunately, because of previous work on the MDL 
objective function, many such estimates are known, at least under various auxiliary 
assumptions. The auxiliary assumption we will adopt is the one followed by Hoffgen 
in [H6f93| . namely, that each node has at most d parents, and we will follow his 
methods to derive more precise versions of the estimates appearing in Hoffgen’s 
paper. 


5.1. Comparison between Empirical and Idealized Log-Likelihood. 

We fix d < n, the maximum in-degree, that is maximum number of parents of a 
node in G. We let Q = denote the space of Bayesian network structures with 
bounded in-degree d. The significance of the restriction on the space of possible 
structures to Q‘^ is as follows: if we consider all the log-likelihoods for G S G'^, 
there are super-exponentially many values to be estimated, and a simple-minded 
application of bounds for individual log-likelihoods followed by the union bound will 
give very poor simultaneous bounds, leading to results for the sample complexity 
which are exponential in n. There is a well-known decomposition of the LL(a;Ar, G), 
for any Bayesian network G, in terms of conditional entropy terms: 


(73) LL{ujn,G) = -NY,Hp^,iX^\P&G{X^)), 

i=l 


where is the parent set of node W in the DAG G. We can use (73) to 

show that a polynomial number of distinct terms appear in all the LL(a;jv,G) for 
G G G‘^, namely at most. 


(74) 


^d+i - \-n+l 


- 1 
n — 1 


which is only 0(n'^+^). If we have an estimate which applies to all of these 0(n‘^+^) 
terms individually, then applying the union bound will give a related estimate of 
all the log-likelihood terms simultaneously. This is the key to achieving a finite 
sample complexity result with polynomial dependence of N on n. We stress that 
the idea we have just sketched out is due to Hoffgen in |Hof93| . We begin proving 
an effective version of the estimates in |Hof93i . 
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Lemma 12. Consider the family of Bayesian networks on n fixed vertices 
with bounded in-degree d. Consider the total number M of different expressions 
pN^ogpN, Here, pN is an empirical estimate in the joint contingency table of all 
the variables, appearing in all the expansions LL(w 7 v,G) (as given in ([7 ^ ). Then 

M<2"‘ 

\\1/ \d+l/y + 1/n - (2(i + 1) 

Proof. When y, Z are distinct subsets of the variable set X, and P is a (joint) 
probability distribution over X, we use 


(75) 


Hp{z\y) := Hp{z,y) - Hp{y), 


where Hp{Z,y) and Hp{y) are the (ordinary) entropies of the probability distri¬ 
butions on the subsets Z uy and y, induced by P. Each of the relative entropies 
appearing in (731 can, by (751, be expressed in terms of ordinary entropies. For 
example, in the case that Va-cixi) = {a;j(i), ■ • • (the largest possible parent 

set under our assumptions) and when P = P, an empirical distribution 


(76) Hp{xPVa.G{xi)) := i7p(a:j,...,a;^(d)) - Hp{xj(^ip ... ,Xj(^a))- 

The number of entropy terms of the form Hp{xi,Xj(i),...,Xj(k)) is bounded by 
k can range from 1 to d. Each one of the entropy terms is the sum of at 
most 2'^"’'^ terms of the form p\ogp where p is an empirical estimate of p, derived 
from N independent samples from a Bernoulli random variable X(p) of fixed but 
unknown parameter p. So there are at most a total of M such terms pN log pat 
appearing in all the expansions of log-likelihoods, with M as defined in the Lemma. 
The bound on the sum of the first d -I- 1 binomial coefficients is elementary and is 
stated and proved at |LuglO| . □ 


By computing all of the conditional entropies in the expansion (73) with respect 


to an arbitrary probability distribution B in place of the empirical distribution 
PuiN j ’"^6 obtain a generalization of the log-likelihood commonly called the ideal¬ 
ized log-likelihood LL^^(P,G). (For more context, see Appendix | b[) Considering 


LL^^(P, G), we have a completely analogous statement to Lemma |12| concerning 
the number of terms of the form plogp, where p is the conditional probability 
of an atomic event obtained by appropriately conditioning and marginalizing P. 
Since terms plogp in the LL^^(P, G) correspond one-to-one with terms plog(p) in 
LL(a;Ar, G), this statement has the same proof. 

We are able to prove the proposition that represents the culmination of this 
section, concerning the (empirical) log-likelihood as an “estimate” of the idealized 
log-likelihood corresponding to the underlying distribution: 

Proposition 12. Let G G Q'^. Let e, d > 0. Let N > Nn{e,5), defined by 




Nn{e, d) = max 
Then we have 


3(1 -I- log2)^4‘*+^n^ 


.24.12exp(w(^)) 


•log 


6{n — {2d -\- 1)) 


Pr, 




||LL(a;w,G)-LL(^)(P,G) 


< fV • e, VG G 




> 1 - 6 . 
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Proof. Each conditional entropy in the expression (164) of LL^^^(P, G) (re¬ 
spectively in the expression (165) of LL(a;Ar, G)) has n conditional entropies. Each 
of the n conditional entropies may be rewritten, by (76), as the difference of two 
ordinary entropies, as we have written out above in (76). Each of the ordinary 
entropies has at most 2^^+^ terms terms of the form p\ogp for p the probability 
of some event (respectively, of the form pArlogpjv for pjv the estimated probabil¬ 
ity of the event). Thus, there are a total of at most n ■ 2 ■ 2'^+^ = n2‘^+^ terms 
of the form p\ogp for p the probability of some event (respectively, of the form 
Pn logpAT for Pm the estimated probability of the event) in the sum for (P, G) 
(resp. LL(tLiAr,G)). In order to achieve accuracy e in the estimate LL(a;Af,G) of 
(P, G) it will suffice to have accuracy e' := ^. 2 d +2 in the estimate of each 
quantity plogp in Proposition!^ 

In order to have all the estimates with confidence 1 — 5, we must have each 
individually with confidence 1 — 5/M, where M is the number of individual esti¬ 
mates. Lemma gives us an upper bound on the number of individual estimates 
M. So we must take 5' = 5/M in the function Nn{e, 5) of Proposition]^ 


We obtain the function in the Proposition by substituting e' := 

5' = 5/M {M from Lemma 12) into the function N of Proposition]^ 


2d + 2 


for e and 
□ 


5.2. Log-likelihood term estimates. For P any probability distribution 
and Q any family of BN structures (DAGs), and any ^ > 0 we make the definitions 

• 13{P) is the set of Bayesian networks of the form {C,pc',p) for G' ranging 
over Q. 

• is the subset of B{P) consisting of those {G',pa',p) for which ei¬ 
ther H{P\\pc',p) > C or for which pg>,p = P (so that H{P\\pg',p) = 0). 

The motivation behind introducing this notation is that in the setup for the learning 
algorithm, we are given a distribution P (unknown to the experimenter), for which 
G is a perfect map (the “faithfulness assumption”). Conceptually (although this is 
not true in practice) we may imagine that the learning algorithm consists of two 
stages, a structure-learning part that returns a DAG G', followed by a parameter 
learning algorithm returns P', maximizing the log-likelihood of the data, LL{uj, G'). 
We wish to be able to refer concisely to all the BNs (structure-distribution pairs 
(G', P') such that G' is an I-map of P'), that are eligible to be learned by our entire 
learning algorithm, assuming that the structure learning part may make an error, 
but that the parameter-learning part of the algorithm is a perfect expectation- 
maximizer (it can look at infinite data and is completely accurate as a maximizer 
of LL(a;Ar, G')). The KL-divergence H{P\\pg'^p) is an appropriate measure of the 
distance of an element G' £ Q from {G,P). Consequently, a positive H{P\\pg',p) 
may be considered a measure of the error of an algorithm which returns G' instead 
of G. The set of structures (G',P'), for which P' is “closest” to P, given the 
constraint that P' factors according to G', is precisely B{P). 

The precise statement we would like to prove in a consistency proof of any score- 
based structure learning algorithm is that the score, considered as a function on all 
of B{P), achieves its maximum at the true network (G, P). Generally, in order to 
obtain a finite sample complexity result, we cannot achieve this and instead have 
to content ourselves with a statement that G maximizes the score on the subset of 
B{P) exeluding those {G',P') ^ {G,P) which are (/-close to {G,P). We denote by 
Bq{P) the subset of B{P) which excludes the structures in B{P) which are ^-close 
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to—but not equal to—(G, P). So can be thought of as B{P), with the ^-ball 

around (G, P) excised, but with the point (G, P) itself added back in. Then the 
consistency results we can prove will say that the score, considered as a function 
restricted to Bq{P)^ achieves its maximum at the true network (G,P). 

We recall the following relationship (e.g. p. 17 of [FYQGj l between the em¬ 
pirical log-likelihood of the data ojn under an arbitrary network G' G G'^ and the 
KL-divergence of P{ujn) based at pg',i^n'- 

(77) LL(a;jv,G0 = NH{P{^n)) - NH{p{u:n)\\pg' 


We also have the corresponding relationship for the idealized log-likelihood of any 
distribution P: 


(78) 


(P, G') = NH{P) - NH{P\\pg-,p). 


It is clear that the condition that P happens to factorize according to G' is 
equivalent to pg',p = P, and also to NP[{P\\pg',p) = 0 , so that by (78), the 
G' which are Tmaps for P are simultaneously the set of distributions for which 
NH{P\\p.^p) takes its minimum value of 0 and for which LL^^(P, •) takes its maxi¬ 
mum value of NH{P). Similar statements follow from ([77]) in relation to LL(a;Ar, •) 
and NH{p{u}n)\\p-,ujiv)- 

The expressions ([TTf and ( |7^ , together with the estimate of Proposition 
allow us to say (with confidence 1 — <5) that, scoring only on the log-likelihood, the 
correct structure (G, P) loses to any competing structure in B{P) by only a certain 
margin, and furthermore, actually beats any competing structure in Bc^{P) by a 
certain positive margin (both margins are actually linear in N). 


Proposition 13. Let e,SX > 0 be fixed, and let d be a fixed in-degree. Let 
G G G‘^ be a fixed Bayesian network. Let Wi(e, 5) denote the function of Proposition 

(a) We have, for all G' e , 

Px^^^G {LL(wjv, G) - hhicoN, G') > -Ne} >1-6, 


for 


N>Nn 



(b) We have, for all G' G G'^, {G',P') G Bq{P) and G G', which is to say, 

for the set of Bayesian networks [G',PG' ,p) G ; excluding only those 
networks such that H{P\\pg\p) < C, 


PW-G 


■|LL(a;7v, G) — LL(a;jv, G^) > 



>1 — 6 , 


for 


N>Nn 



Proof. The idea is to estimate the difference between the empirical log-likelihoods 
by the difference of the idealized counterparts. With probability 1 — 5, we have by 
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Proposition [T^ 

LL{ujn,G)-LL{ujn,G') = LL^^\P,G)-LL^^\P,G') 

+ LL(a;Ar, G) — (P, G) 

+ LL^j^\p,G')-LL{ojn,G') 
> NH{P) - NHiP\\pG,p) 
-NH{P) + NH{P\\pg',p) 


JGV, 

N>N^\ 

(- -I 
'.2’ 2/ 

jfA, 

N>Nn- 

(if) 


Regarding the expression appearing on four lines after the last inequality: the 
quantity NH{P) in the first line cancels with its opposite —NH{P) in the second 
line. As discussed in the paragraph preceding the proposition, the function G' i—>■ 
NH{P\\pg',p) takes its minimum value of 0 at G' = G, and so is positive at G' S 
Thus, 


LL(u;Ar, G) — LL(wAr, G') > • H{P\\pg',p) ~ ^ 2C 


e, iV>7V„(f,f) 


iV>7V„(§,|) 


> N ■ 


-e, 


iG\P')eB{P), N>N^{I,I) \ 
C-f, {G',P')eBc{P), N>N^[l,l))- 

The key aspect of Proposition which allows us to make these estimates is the 
quantification over all G' S G'^: inside the estimate of Proposition 12 □ 


In Proposition we have summed up all we have to know about the log- 
likelihood terms in order to proceed with the proof of the finite sample complexity 
in the n-node case. 

5.3. Competing Network is ^-Distant from Generating Network. When 
the competing network G' ^ G belongs to Bi^{P), we can use part (b) of Proposition 
[Tslto show that the difference of log-likelihood terms (72) is positive. As a result, 


we will be able to show that the difference function Srj^NiG, G') is (with high prob¬ 
ability) eventually positive, without using the linear growth of any of the sparsity 
boost terms from {A,B) G G'\G. Note that for G\G' C A(G, G'), tT((A, R), G, G') 
is always —1 or 0, and Sa,b{G,G') = Sa,b{G'). Thus, ignoring the terms in the 
sum over A(G, G') from G'\G (which is always positive, since cr((A, B), G, G') = 1 
on G'\G) we have 

(79) 5^,iv(G, G') > LL(w^, G) - LL{ujn, G') + i/-i(iV) (|G'| - |G|) 


-MN) 


E 


{A,B)eG\G‘ 


+MN) 


E 

(A,B)^E(G)UE(G') 


max min — In 
SeSA,B{G') sGval(S) 


max — max 
seSA,B(G) seSA,B{G') 


PN'^iPi^N,A,B\s)) 


min — In 

sGval(5) 


I3^t{p{ujn,A,B\s)) 


We now explain why we have to use part (b) of Proposition 13 on the log-likelihood 
terms in order to bound this difference from below: the other terms, namely the 
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difference of complexity penalties and the opposites of the sparsity boosts, are both 
negative. In order to get general lower bounds on these negative terms, we have to 
consider the case of G' the empty network on n nodes, in which case \G'\ = n < \G\, 
and also there are \E{G) \ terms in the sum of opposites of sparsity boosts. Although, 
with the choice logarithmic and ijj 2 {N) constant, these terms are at most 

roughly logarithmic in N, we still need positivity of the difference of log-likelihoods 
to achieve positivity of the entire difference. 

Henceforth, we will make the following assumptions in the proof: 

(a) We have G is a perfect map for P, G' ^ G, and {G',pc',p) G 
equivalently, G' satisfies H{P,pqi^p) > (^. 

(b) The separating set § satisfies 


maxcrfG) < d, 

G&g 

meaning that for all G S ^ and for all {A,B) G V^^\A, and for all 

SeSA,B(G), |5|<d. 

(c) The separating set § is bounded by a polynomial in n of degree d, i.e., 


(80) 


< 


The condition (80) means that 


For all (A,B) G 


U ^Ab(G) 

Gea 


< 



(d) We are using the following choices of weighting functions: 
tpi{N) := KlogN, '4’2{N) := 1, k a constant (e.g., k = \). 

(e) For (A, B) i E{G) U E{G'), we have Sa,b{G') C Sa,b{G). 

We now incorporate these assumptions to estimate Srj,NiG,G'). Assumption (e) 
implies that the third line in (79), namely the sum over (A, B) ^ E{G) U E{G'), is 
positive. The reason is that, when Assumption (e) is in force, the maximum of the 
sparsity boosts over the larger separating set Sa,b{G), is at least as large as the 
maximum of the sparsity boosts over the possibly smaller subset Sa,b{G'). For the 
purposes of bounding Sjj^n{G,G') from below, we can therefore ignore the third 
line of (79). Now, applying part (b) of Proposition 13 to the log-likelihood part 
we obtain that, for each > 0, we have, with probability at least 1 — (5i, fo r al l 


iV > A where this function is as defined as in part (b) of Proposition 


13 


(81) 


5^,n(G,G')> ^-NlogA(IGI-n) 

— |i?(G)| max min —In 

(A,B)£G sGval(S) 
seSA.BiC) 


l3N'^{p{t^N,A,B\s)) 


We stress that the quantities |G| and |A(G)| are in general different: the former 
is the number of parameters in a probabilistic model for which G is a P-map, 
exponential in the degree, the latter the cardinality of the edge set of (the skeleton 
of) G. 

Now we would like to apply Corollary to bound each sparsity boost 
— In /3^ Tijpiiojsi, A, i?|s)) from above by a universal constant In 0“^, but the prob¬ 
lem with applying the proposition directly is there is data fragmentation. Of the 
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N points in the sample only N/m (in the worst case) are expected to contribute 
to the test statistic A, B\s)). Lemma 13 below, quantifies by how much 

we have to increase N to overcome this problem. Before stating and proving that 
lemma, we introduce some notation to help us keep track of such things. 


Definition 4. Let N' > N > 0 be positive ingers. Let S C V and s G Val(5'). 
Let u>]v'ls=s denote the subsequence ofujN' consisting of those points which take the 
joint values s in the variable S. When the number of such points in the subsequence, 
card(wjv'|s=s); equals N, we will sometimes use the notation 

Yn := ujn'\s=s- 

Note that this notation emphasizes the number of points N in the subsequence, 
at the cost of suppressing the conditioning assignment s from the notation. 


As a result of the comment in the definition, the notation Y/v is used, for the 
sake of readability, in those portions of the argument where s does not matter. 


Lemma 13. Let 9 > 0. 

(a) For a fixed S G S'a,b(G) and s G Val(5'), let 

m = mp{s) := [p(S' = s)] ^. 

Then, 

{card (w™iv|s=«) < (1 - 0)Af} < 

(b) For a fixed edge {A,B) G V^^\A, in particular for {A,B) G G, let 

m = mp{{A, B),G,S) := max [p(S'= s)] 

ses^ b(<3) 

sGval(5) 

Then, 

\ min card (a;™^|s=,) < (1 - 0)1V I < \s\^-Nm6Gfi 

y sGval(5) J 

(c) Recall the notations Sa,b{G), crsiG) and and m{G,G,S) from 

Section [H In particular, set 

(82) m = mp{G,G,S) := max mp{{A, B),G,S), 

where 

mp{{A, B),G,S) := max [p(5'= s)]~^. 

ses^,B(S) 

sGval(S) 

Then 

PW»^P< min min card {uj^n\s=s) < (1 - 9)n\ < \E{G)\-i:siGy2'^^^^e-^^^"/\ 
{A,B)eG s£Sji^B(e) 

{ sGval(S) J 

Proof. Part (a) is a straightforward application of the Multiplicative Chernoff 
bounds, second part. Part (b) follows from the observation that the value set 
Val(S') for S G Sa,b{G) can have at most joint values, and from 

the union bound applied to the estimate in Part (a). Part (c) similarly follows 
(a) and from the union bound, the fact that as{G) (resp., provides an 
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upper bound for b{G') |5'| (resp., |5'yi_B(G'')|) where G' ranges over Q and 

iA,B)GE{G). ’ ’ □ 

Corollary 6. In accordance with Definition 0 if N' > m(l — 9) , then 

with probability at least 

(83) 1 - \E{G)\■ ^siG) ■ 

for all [A, B) S G, for all S S Sa,b{G) and for all s € Val(S'), 

WAr'|s=s can be written asYjy, 

where Ypf is a sequence of observations, all of which take the joint value s at S G V. 


Definition 5. Fix p S V. For each {A, B) G G, and for each S G SA,BiG), let 

s = Sp{{A,B),S) = argmax3gvai(S)'r(p(A5|s))- 

The hypothesis on minimal edge strength in G says precisely that for all 
iA,B) e G and for all S G Sa,b{G), we have 

t{p{A,B\s)) > e. 

Because Corollary is quantified over all {A,B) G G, over all S G Sa,b{G) and 
over all s G Val(5'), we have that if N' > m{l — 9)~^N, then with probability (83), 

card(a;Ar'|s=j) > N, 

for all {A,B) G G, S G Sa.bIG) ■ That is to say with probability (83) that as 
{A,B) varies over E{G), all of the 

WAT'|s=sp((A._B),s) can be written, simultaneously, in the form Y^. 

Now we are able to bound from above the sum of sparsity boosts. 

Lemma 14. Let N', m, 9 be related by 

N' = m{l - 9)-^N. 

Let 

24 

(84) N' > to(1 - 9)~^ (A)]~^ 1 _ 0 • 

With probability at least 

1 - |L;(G)| • YsiG) ■ 2-(e)g-iVmeV3 _ |i;(G)|E 5 (e).F^ ( 
we have 


A - 


N 


( 1-0 


(85) 


max min — In 

(A,B)€G sGval(S) 
seSA.BiC) 


/3^,t{p{ujn',AB\s)) <log0' 


Proof. We prove the bound for arbitrary {A, B) G G, arbitrary S G Sa ,b{G), 
and for the one value s G Val(S'), where s := s((A, B), S), at which the maximum 
separation > e from independence is achieved. This will suffice because of pattern 
of maxes and mins on the left-hand side of (85), the quantity to be estimated. By 
the comments between Definition and Lemma 14 we have for each [A, B) G G, 
S G SA,BiG) and s := s((A, B),S), for N' > m(l - 9)-^N, 


-In 




= - In 


filT{piY^,A,B)) 


for Yat a sequence of observations of length N, with probability (83). 
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Now assume that N satisfies (84). By Corollary and the union bound, for 
fixed {A,B) e G, the sparsity boosts is less than or equal to 


-In 


pP^ripiYN^AB)) 


< log© 


-1 


with probability 




One further use of the union bound completes the proof of the statement over all 
{A,B)gE{G). □ 

We can now estimate the entire difference of objective functions. 

Proposition 14. For N' = m{l - 9)~^N, 


( 86 ) 


N' > max 


to(1 — 9) ^ [^(A)] ^ log 


24 


'I I 


1-0 

we have with probability at least 

(87) 1-Si-\E{G)\- EsiG) ■ - |S(G)|S5(e)^iv ( 

the lower bound 

(88) 5c(G,G',a;Ar-) = ^-«:logA'(|G|-n)-|i?(G)|log0-i. 

Thus, if we assume, in addition, that 

(89) 


A-J-^'(l-0 


/ |£^(G)I 


■) 


c 


\^3k(|G| — n) 


then we have with probability at least 

S^iG,G',uj'j^)>0. 


Proof. The first statement follows from Lemma 14 applied to (81). The 
second statement follows from solving (88) for N'. □ 

Proposition gives the most general form of the finite sample complexity. 
The only thing that remains is to explain how to choose the number of samples N' 
so that (87) is at least 1 — for <5 > 0 an arbitrary error probability set by the 
experimenter. The following gives one way (of many possible ways) of dealing with 
the most complex term in the error probability. 


Lemma 15. Let P, A, 5' > 0. Suppose that N is an integer satisfying 
(90) iV>maxf f log^, -F" f log 

Then < 


24 

¥ 


A, and 


FN{A-Ff,HT)) <S' 


Proof. By Lemmaj^ 

N > (log 24 — logP) 


1 -1 
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is equivalent to N > 0A^(r), which is equivalent, by the definition of G to 

2 

Then the quantity inside Tn is at least y. Further, by the same 
reasoning, (y) < S' is equivalent to 


N > (log 24 — log S') 


F{ ^ 


So, if (84) is satisfied then, because Fn and Fn are decreasing 
FN{A-F]^\r))<FN(^] <6'. 


□ 


Lemma 16. Let S > 0 be given. Set (5i = |. Also, suppose that N' satisfies 


(86), so that in particular 


iV'>iV„(|4V 


Also suppose that N is an integer and such that N, N', satisfy N' = Nm{l — 9) ^, 
and such that 

, 3 3|F;(G)|S5(g)2-(^) 

to6»2 ® 5 


Finally, suppose that N is greater than the quantity on the right side of (90) with 
r set at 1 — Q and S' set at ■^\e{g)\'Ss(Q) ’ 


N > max 


FI ^ 


1 -1 


log 


24 

1 - 0 ’ 


F{ ^ 




s I ■ 

Then the quantity in (87) given as a function of N' (and N) is at least 1 — 5. 

This completes the proof of the finite sample complexity result in 
the case when the competing network is ((-distant from the generating 
network. 


5.4. Competing network is not Contained in Generating Network. 

Now, assume that Skel(G') ^ Skel(G). In this case, there are not only terms (with 
negative sign) corresponding to edges {A, B) S G\G' but also terms with positive 
sign corresponding to edges {A,B) € G'\G: 


5„.jv(G, G') = G) - LL{ujn, G') + MN) (|G'| - |G|) 


-MN) 

+ '02 (N) 


E 


{A,B)eG\G‘ 


max min — In 

SeSA,B{G') sGval(S) 


E 


{A,B)eG'\G 


+MN) 


E 

(A,B)^E(G)UE(G') 


max min — In 

SeSA.BiG) sGval(S) 


max — max 


ISN'''iPi^N,A,B\s)) 


Pn^(.p{^n,A,B\s)) 


SeSA,B(G) S^Sa,b(G')J \sGval(S) 


-In 


Pnt{p{u:n,A,B\s)) 


The strategy will be different because we will prove (probable) asymptotically linear 
growth for the third line. As a result, we will need only to hound the first two 
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lines, rather than obtain a positivity result for them. In particular, it will not be 


necessary to use part (b) of Proposition 13 and we will instead use part (a), with 


no assumption that G' is ^-separated from G in KL-divergence. 


We continue to assume that conditions (a)-(e) from Section 5.3 are still in force. 


We can use the same result, Lemma [14| as in Section 5^ to bound the negated 
sum of sparsity boosts, because this is exactly the same sum as appeared in 
Sn{G,G 'for the case of Ske^G') C Skel(G). Thus, by straightforward ap¬ 


plications of Proposition [T^a) and Lemma 14 we get the following counterpart to 
Proposition 

Proposition 15. Let e,Si > 0. Let 0 S (0,1). Let N' be a positive integer 
satisfying 


N' > max 


m(l — 9) [^"(A)] log 


24 

1-0 


N I - — 


e 

2 ’ 2 


where is the function of Proposition [H[ We have with probability at least (87) 
that 

(91) S^{G,G',ujn') > -lV'e-KlogiV'(|G| -n)- |L;(G)|log0-i-f 


MN') 


E 


max min — In 
. SgSa,b(G) sGval(S) 




_(A,B)eG'\G 

We now begin to study, and bound from below, the second line in this lower 


bound (91) for Srj{G,G' For {A, B) G and S G SA,BiG), recall the 

notation 

m„(S) = max mAs) = max \p{S = s)l ^. 

sGVal(S) sGVal(S) 

Definition 6. Let (A^B) G G'\G. Because {A,B) ^ E{G) andS is a separat¬ 
ing collection for G in Q, by Definition^ there are some (at least one and possibly 
more than one) S G Sa,b{G) with the property that A _LL B\S. Define 

(92) S SpiiA,B),G) G <S'a,b(G) such that A _LL i?|5 andmp{S) minimal. 

Note that there are potentially several S G Sa,b{G) such that A _LL B\S, and 
the mp{S) are all equal and minimal for S G Sa,b(,G) with the property that 
A AL B\S. But because there are only at most |S'^,b(G)| < ^^(G) such S, there 
are only finitely many choices of S. In order to make the notation concise but tech¬ 
nically correct, we assume a fixed but arbitrary choice of S for each (A, B) G G'\G 


which satisfies (92): the choice of S has no effect on the argument, so long as one 


choice is fixed throughout the argument. 

Since our purpose is to estimate from below the maximum of the sparsity boosts 


over S G SA.BiG), we may replace the entire maximum over S'a,b(G) in (91) with 
the expressions being maximized evaluated at any particular S', for example, at 
S = S. 

Proposition 16. Let C be a subset of G'\G of cardinality L. Let 

rh ■.= rhJG. S)'.= max mJSJ(A,B),G)). 

(A.B)Gyx2\G ^ ^ 


( 93 ) 
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Denote mp{G,S) more concisely by rh. Let N' and N be positive integers such that 
N' > {1- e)-^rhN. Let fi G (0,1). Let T satisfy T G (0,r““(lV, 77 )). With 
probability at least 

1 - - asiG)LTNiv{l - m )), 

we have for all {A, B) G C, with S = S{{A, B), G) defined as above 


min — In 
sGVal(S) 


Pn' iTiPc.j^,,AB\s)) 


> NT. 


Consequently, 




y max min — In 

Pn't{p(ion',A,B\s)) 




> LNT. 


The main step in the proof is to prove a version of Lemma 13 ’c) which is 
appropriate for the case at hand. 


Lemma 17. Let C c {{A,B) G G’\G}, L ■.= 
Then 


Card(£) as above, m as in (931. 


Pr I min min Card (a;,f, 7 v| c_ J <(1-0)n\ < 

\iA,B)eCseV3,\{S{iA,B),G)) ^ ^ j 

where 

Tss{G) := max Sa,b{G). 
(A,B)gG 


Proof. We begin with Lemma nib), in particular the version with 
{A,B) G P^^\A. Apply Lemma [Tib) to all the {A,B) G C C simul¬ 

taneously. The event whose probability we are estimating in the present Lemma 
differs from the event whose probability we have estimated in Lemma [li e) in only 
two ways: first we have {A, B) ranging over the set C, instead of E{G). Second, S 
belongs to Sa,b{G), which is a subset of the larger set Sa.b{G)- The first difference 
accounts for the appearance of L instead of |if(G')| in the estimate. The second 
difference accounts for the appearance of S 5 (G), in place of T,s{Q) in the error 
estimate. □ 


Completion of Proof of Proposition [TBl Take A' > (1 - 9)~^rhN. Let 
ujn' ~ P with Card(a;7V') = N' 


By Lemma 17 with probability at least 

1 - 

we have that for all s G Val(,S), 

= bjv, Card(yAr) = N, 

for Ya! a sequence of observations meeting the criteria that the observations take 
the joint value of s at the variable-set S. 

Then we can apply Corollary [^ to each one of the terms in the sum. Us¬ 
ing Corollary to estimate and the union bound, we conclude that for ~ P, 
Card(a; 7 v/) = A', we have, by Lemma 17 with probability at least 

1 - LE5(G)2'"(^)e-^™®'/^ 
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that = In, Card(YAr) = N. For 


re (o,r“"’^(iv,/i,r?))= 0 


-logs 

N 


F{nr])) 


we have 


S min - ln/3^ (T(a;Ar', S|s)) > TVF ^ > 1 - Ivans’)] J'Ar(?7(l “ m)) 

[sGVal(S) J 

> 1 - asiG)FNivi^ - m ))- 

To complete the proof, we need a bound that holds for all {A, B) G C simultane¬ 
ously, so we apply the union bound one more time, at the cost of one more factor 
of L appearing with the as{G)Fi\[{r]{l — fj,)) term in the error probability. □ 

Combining Propositions [Ts] and [Thl we obtain 


Proposition 17. Let € (0,1), m as in (93). Let N' be a positive integer 
satisfying 


N' > max 


m(l-6i)-i[F(A)]”^log 


24 




1 - 0 ’ " V 2 ’ 2 

and let N be a smaller positive integer and 9 e (0,1) such that 

(94) N' = to(1 - ey^N. 

Let T e (0, r™®'’'(fV, p., 77 )). With probability at least 

(95) 

I-Si- \E{G)\■ EsiG) ■ - |£;(G)|Ss(0)^n (a - - 0)) 

- S5(G)L2‘^(®)e-'^™®'/3 _ asiG)LFNir]{l - y, 

we have 


(96) S^iG,G',ujy > 

In this case, if we also have 


/Lr(l - 0 ) 


-e A' - 6;(|G| -n)logA'- |F;(G)|log0-h 


r > 


L (1 - 0) 


then Sri{G, G',lun') > 0 for all suffieiently large N', in particular, for 




E(G)I 


(Lr(l -0)- me) 0 (IG|-M« 


mK{\G\ — n) j 

Proof. The only relation that needs further explanation is where the coeffi¬ 
cient in (96) comes from. From Propositions 15 and 16 we have the following two 
linear terms in the bound 

S^{G,G',ujN') > LTN + eN' + ■■■ 


where N and N' are as in (94). Then because of (94), 

/Lr(l-6») 


Sr,{G,G',WN') > 


\ m 


-e A + 


as claimed. 


□ 
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This most general formulation of the finite sample complexity is evidently un¬ 
satisfactory as it stands for three reasons: it is not clear under what conditions we 
have the non emptiness of the interval, 


so that there is a choice of T satisfying the hypotheses. Secondly, in order to study 
the asymptotics of N', the number of samples, we need all the conditions on N' 
to be expressed explicitly, with ‘W' > ” on one side and expressions involving all 
the variables on the other side. Third, since the error probability is a choice of the 
experimenter, we should not have S expressed in terms of the other parameters, but 
instead we should have N' expressed in terms of the desired confidence 1 — 5 and 
any other parameters. 

As for the first task, concerning the choice of T, we have to show that there is 
a choice of A, N, n so that the interval of possible choices of T is nonempty, in other 
words so that 


(98) 


L{l-6) 




(99) 
and 

( 100 ) 


Lemma 18. The condition (98) is equivalent to the following two conditions 

cm 


Fim) > 


N > 


L{i-ey 

log 24 


Fim) - 


( 101 ) 


Proof. For the interval in (521 to be nonempty, it is equivalent to have 

log 24 


em 


L{i-e) 


< - 


N 


+ F{fj.r]). 


For a fixed choice of A,/r, there will be an N satisfying (101) if and only if (99) 
holds, because the term — must be negative. If (99) does hold, then we can 
solve (llOll) to obtain (llOOl). □ 


( 102 ) 


We can ensure that (99) is satisfied by taking 

(1 - 0)LF(H 


4m 


Then the condition (1001 is satisfied as long as 

4 log 24 


(103) 


N > 


^Fipif)' 


The hnal task, namely, to write down explicit conditions on N' so that the confi¬ 


dence (95) is at least 1 — 5, for given 5 > 0, can be accomplished in a number of 
ways. We choose a very simple way, which is still sufficient to extract the asymp¬ 


totic dependence of N' on 5, namely, we set each of the 5 subtracted terms in (95) 


equal to |. The following Lemma sums up the explicit conditions on N which we 
obtain thereby. 
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Lemma 19. Let 6 > 0 be given. Let /i, 0 S (0,1) be parameters as above. Let 
m be as in (Qi). Then if we take 

6 


(104) 


= 


5’ 


and N larger than all of the following guantities: 
(105) 


3 5|L;(G)|S5(0)2-(<^) 

log j. : 


(106) 


(107) 


(108) 


max 


F -r 


mO^ 

-I -1 


log 


24 

1 - 0 ’ 






I ? 


3 , 5LS5(G)2'^(G) 

log-j-: 


log 


/120(75 (G)L 


[-P(?7(l - m))] 


-1 


Then (95) is at least 1 — <5. 


Proof. Each of the conditions (104)-(108l corresponds to one of the sub¬ 
tracted terms in (95). The derivation of (105) and (107) requires no further expla¬ 
nation, as they just result from setting the second and fourth terms equal to | and 
solving for N'. In order to obtain ( 106| ), we set the third term 

\E{G)\J:s{g)^N (a - J-^1(1 - 0)) < 


and apply Lemma [Ts] with 


(5' = 


5|l;(g)|E5(0)- 


For the final condition, (108), set the final term of (95) equal to S/5, obtaining 

5 


which is equivalent to 


FNivil- fi)) < 




5(75 (G)L’ 


□ 


5asiG)L' 

Now use Lemma to obtain that this condition is equivalent to (108). 

Completion of Proof of Theorem [5l For Part (a) start with condition 
(891 and the conditions in Lemma 16 changing variables from N' to N throughout 
using the relation N' = N{1 — 6)~\ 

In order to obtain the conditions in Part (b), start with the conditions in 
Proposition 17 Use the choice of e given in (102) to obtain the form of the condition 
(971 given in (39). Since 


N > 


by (0, 


2 log 24 
F{m) 


r““(iV,Ai,ry) > 


F{m) 


2 
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Further, with the choice of e in we also have 


em 


Lil-0) 


^ F{m) 


So we may take 


r = 


Fim) 


Thus 


LFd - 0) - m. > (1 - 0) - (!-»)= “- , 


Thus ( p^ is bounded by (39). Finally, we obtain the remaining conditions by 
taking N larger than the quantities in Lemma [T^ In order to obtain the theorem 
in the form stated, we change the notation for the number of samples back from 
N' to N. □ 


6. Refinements nsing Sanov’s Theorem 

6.1. Case of Two Nodes. We are improving the bound under the assump¬ 
tion that the underlying network is independent, so throughout this section, we will 
work under the assumption G = Gq. We will find an 77 ^ G (0, 77 ) with the property 
that 

S^,n{i) > 0 for [ 0 , 77 ^], 
and we will then bound the error probability 

(109) {t(w 7 v) G [r]-, 00 ) | G = Gq} . 

The first step is based on bounding /3^ ( 7 ) for any 7 G (0, 77 ) from above. We will 
bound from above the probability of emission of a distribution in by (which 

is to say, (3^ ( 7 )) by the use of Sanov’s Theorem. According to Sanov’s Theorem, 
the key statistic for estimating this probability of emission is the KL-divergence 
niq^Wp"^) where g'*' is defined as follows. 

Definition 7. We let denote the element ofV which is the I-projection of 
p^ onto A°. 

Note that even though p^ has uniform marginals (by definition), q'^ does not 
necessarily have uniform marginals. We now list the facts that are used to bound 
H{q^\\p^), and with it (3^ (^): 

Lemma 20. For any p(t) and q{s) G Vk,i, with p G Vq and q G Vq and 
t G [0,tmax(p)), S G [0,fniax(g)), WC haVC 

||g(s) - p(t)||oo >{k + l + 1 )"^ \t-s\. 

Proof. The proof is quite elementary: see Sectionfor details. □ 

Recall the well-known: 

Lemma 21. Pinsker’s Inequality. Given any pair of distributions q^p GV, 

H{q\\p)>2\\q-p\\l. 

Proposition 18. Assume that k = I = 2 and 7 G ( 0 , 77 ), so that tfj < tf^ ■ 
Then 

log/S^" ( 7 ) < 1^1 log(N-f 1) - 
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Proof. First use Sanov’s Theorem, then Pinsker’s Inequality (Lemma[2T|) and 
finally Lemma [20j 


( 110 ) 


log/?^”( 7 ) < 1^1 log(iV + 1) - H{q<\\p^) ■ N 

< |X|log(iV + l)-2|!(7^-p''||^.iV 


< |X|log(iV + l)-2 


\ 7 V 


■ N. 


The reason for the coefficient ^ is that in the case k = I = 2, the constant 
{k + l + 1)“^ in Lemma 


20 


is i. 


□ 


In order to study the dependence of and in the special case k = I = 2, we 
are going to use Taylor’s Theorem with Remainder, which allows us to write out a 
function as a partial Taylor series (i.e. a polynomial) where the remaining terms 
are given (estimated) using the mean value theorem. We start with the following 
expression for the test statistic: 


Po,o 


t(p) = Po,o log ■ 

Pa,oPb,o 

Then for any t S R, 


■Po.i log ■ 


Po,i 


Pa,oPb,i 


-Pi,olog ■ 


Pi.o 


Pa,iPb,o 


-Pi,i log ■ 


Pi,! 


Pa,iPb,i 


I Ij-W ( , j-M PA,0PB,0 + t , ^ PA,0PB,l-t , 

T{P{t)) = (PA.OPBfi + tj log + [pAfiPBA - t) log -- h 

Pa,oPb,o Pa,oPb,i 

, ,N, Pa,oPb,o - t , / , ,N, Pa,iPb,i +1 

[pAAPBfi - t) log- h (PA.IPB,! +i)log-. 

Pa,iPb,o PaaPb,i 

Using Taylor’s Theorem to expand T(p(t)) for positive t around the base point 0 
we obtain: 


r(p(f)) = t(p(0)) 


dT{p{t)) 


dt 




id^pit)) 

2 


dt^ 


for some s S (0, t). 


Since the value of this function and its first derivative are both zero at the base 
point t = 0, this reduces to 
( 111 ) 


-W)) = 


1 5V(p(t)) 


OH 


for some s S (0, t) 


1 


+ ■ 


1 


1 


+ 


1 




yPA,0PB,0 + S PA,lPB,0-S PA,0PB,1-S PAAPB,1 + S ^ 

Lemma 22. For p(t) G 7 ^ 2,2 with uniform marginals, that is where p(0) is 
the uniform distribution, we have T{p{f)) = rj implies 

1 


P 


<tt< 


1 


Vv- 


2v^V 2r? + 1 “ “ 2 V 2 

Proof. By using the hypothesis that the marginals are uniform, 

1 

2 > 

1 1 


PA,o = Pb,o = 5 ) we have that (111 I specializes to 


Tip(.t) = -4 


4s- 1 


Define 


fis) = -4 


4s + 1 
1 


V, for some s € (0, t). 


1 


4s — 1 4s + 1 
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Figure 2. Graphs of /(s) and f'{s) from the proof of Lemma 22 
showing the monotonicity of / on the feasible interval. 


so that 

( 112 ) 


T{p{t)) = f{s) ■ for some s G (0,t). 


Claim. The function /(s) is increasing in (0,t). 
In order to prove the claim we compute that 


(113) 


32 


32 


(4s-1)2 (4s+ 1)2- 


The derivative (113) has no zeros in (0,t), because we can easily solve to find that 

32 


32 


= 0 s = 0. 


(4s-1)2 (4s+ 1)2 

Therefore, /(s) has no critical points in (0,t). Further, we can then show that the 
derivative (113) is positive for s G (0,t). 

Because of the Claim, and ( |112 ), we have 

f{Q)f < T{p{t)) < 


8t‘^ < T{p{t)) < -4 


8r <V< -4 


1 


1 


4t - 1 4t + 1 
1 1 


4t - 1 4t + 1 




Solving these inequalities for t we obtain the bounds in Lemma 22 


□ 


Because does not necessarily have uniform marginals, we need to derive a 
more general, though slightly weaker, version of the upper bound on t from Lemma 
In order to do so, note that when fc = Z = 2, for i,j such that 0 < i < k — 1 
and 0 < j < Z — I, we have 


PA,iPB,3 + (-I)*+'’s = p{s)ij 
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Substituting this equality in (111), we may write the Taylor expansion in the second 
line of ( 111 ) as 

(114) 


1,1 


T{p{t)) = d I H I 

0 .1=0 


Note that, because the numbers p{s)ij are the entries of the contingency table of 
a probability distribution p(s) G 7 ^ 2 , 2 , 

1.1 

= 1 - 

ij=0 

and there are k ■ I oi the numbers p{s)ij. Therefore, at least one of the p{s)ij is 
less than or equal to {k ■ l)~^ = Further, each of the terms in (114) is positive. 
So we have 


(115) 


1 


1,1 


11E > 5 ■ 4=2. 


0 . 1=0 


Lemma 23. For any q { t ) € 7^2,2, where g(0) G Vo (product distributions), we 
have T { q { t )) = 7 implies 


(116) 


Proof. By (114) and (115), 


Thus, 


t<Jl. 


T(p{t)) > 2f. 
7 > 2 t^. 


Solving for t , we obtain (116). 


□ 


Using Lemma 


22 


to bound t)l from below and and Lemma 23 to bound from 


above, in Proposition 18 we obtain the following upper bound on log/3^ . 


Lemma 24. Assume that 7 < ^ 2^+1 • 

log/?^” ( 7 ) < 1^1 log(7V + 1) - 


1 ? 


- V7 


25 V2 V 2?? + 1 

We can then use this bound on log /3^ ( 7 ) to bound the difference of objective 
functions. 


Lemma 25. Assume that 7 < ^ ■ Define 


(117) 

Then 


K := K — 


1^1 log (^) 

log 


( 118 ) (^l(ly/^-y7)U7)« + («'-|v|)iogJv. 
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Proof. By ( |47[ ), 

Sr,,N{l) = -Nj + K\og{N) - log/S^" ( 7 ) 

>-Nj + ^ logiN) - |X| log(iV + 1) - ^ -Vi) -N 

By dllTt , 

K\og{N) = k' log(A^) + |X| log(A^ + 1) ~ l-^l logiV. 

Substituting this into the above expression and gathering together the terms in¬ 
volving N and log TV yields the estimate (1181. □ 

In order to find 77 ^, what remains is to find conditions on 7 for the right-hand 
side of ( 118 1 to be positive. Conveniently, ( 118 ) amounts to a quadratic polynomial 
in ^/7 which we can solve analytically. 

Proposition 19. Assume N > 0 is fixed. Let n' be as defined in (1171 . 
Assume that 


(119) 

then if we define 


N > exp W 


V 



96{2r] + l){\X\- k') J ■ 
- 2400(|X - 


48 


> 0 , /or 7 e ( 0 , 77 ^). 


11 


144(277 + 1) ’ 


;N ■ 


Proof. By (1181, we have > 0 provided that 

2 


In order to make the formulas easier to manipulate, we set, for the purposes of this 
proof 

C:={\X\-k')^°^^ 


N 


y ■■= 


V 


and 


277 + 1 ’ 
X := v^. 


We substitute these definitions into (121) to see that we are seeking x for which 

1 /I 


-C>0. 

Expanding the squared difference and gathering terms according to the power of x, 
we see that this is equivalent to 


- lOOC > 0, 


-lOOcc^ +4 


\y-x 


2 
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that is, 

-96a:2 - Ayx + - lOOC) > 0 

Solving the corresponding quadratic equation, we obtain the solutions 

-Ay ± 4^251/2 - 2400(7 


X = 


192 


Since x = > 0, we are interested in the nonnegative solution 

(122) X = ^ +4v/2V-2400C 


192 


In order for (122) to define a real number, we must have positivity of the expression 


inside the radical, which is equivalent to 


^ 96’ 

which is equivalent to 

fl23) ^_ 

^ ’ N 96(27? + 1)(|X|-k0‘ 

Condition (1231 is satisfied as long as ( 119[) in the Proposition is satisfied. Returning 


to the condition (|122|), in terms of 7 , 77 , N, this condition says that 

V 


7 < 


2r)+l 


25.^-2400(|X-7c'|)i^ 


48 


Thus we can set 77 ^ equal to the right-side of this inequality. As TV —)• 00 , the 
subtracted term inside the second radical in the numerator approaches 0 , so that the 
entire numerator approaches A ^Thus the entire right-hand side approaches 

2 



144(277 -t l)’ 


□ 


It remains to carry out the second part of the strategy outlined at the beginning 


of Section 6.1 which is to bound the error probability (109). Set 5 equal to this 

{t(wjv) > Vn} ■ 


log 5 < IA| log(TV + 1) - H{r^- |bo)iV, 


error probability, which is to say 

A •— Pt 
u r Lujn-^pq 

Then Sanov’s Theorem says that 

(124) 
where by definition 

(125) Ibo) := inf {H{q\\po) \ Q & ■ 

More generally we can derive an estimate, for 7 > 0, of the minimum KL-divergence 
of an element in the set V-y based at po G Vo- 

Lemma 26. We have for any 7 > 0, 

H{Vy\\po) > 7- 
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Proof. For the purposes of this proof, for q S V-y, we let 
go denote the M-projection of g onto Vq- 
By definition of the M-projection, 

^(qIIpo) > ■f^(<7lko)- 

But H{q\\qQ) is the mutual information r(g) of q, and is defined as the set of 
q GV such that r(g) > 7 , so that H{q\\po) > 7 for all q S V^. So the above says 
that H[q\\po) > 7 for all q S Vo- By (125) this completes the proof. □ 

By applying Lemma 


26 


with 7 = ? 7 jy to (124), we obtain 


(126) log(5< |X|log(A^-bl)-ry^-fV, 

which allows us to prove the following estimate of the error probability. 

Proposition 20. Letpo & Vo- Let 5 > 0 be given. Assume that 

/ w 


(127) 

N > Tiyy 


Vn 

Then, 

^ ^ujn^Po 




jX|exp(j^) 
{r(a;w) > qjj} < 6. 


- 1 . 


Proof. By using (126), we see that we will have the claimed estimate on the 
error probability if 

1^1 


N — hlLi iog( + 1) > log S ^. 
Vn 


Solving this inequality for N, we obtain (127). 


□ 


Completion of Proof of Theorem [3l Part (a) consists of repetition (of 
the G independent case) from Theorem For part (b), combine Propositions [I^ 
and The only modification that has to be made is that in Proposition the 
parameter k' depends on N. In order to make the conditions on N explicit, we 
have chosen a /i S ( 0 , 1 ) such that 

logiV > 


Thus 


1^1 log ^ 

logiV 


< 


1^1 

log TV 


< KpL. 


Thus by (117), k! > k{ 1 — p). 

For part (c), the only asymptotic statement we have not yet proved is for . 
The expression responsible for the asympt otic s is the final element of the maximum, 
namely the one involving By Lemma 


29 








1 |X|exp(^) I \^N 


below, we know that 

= 0 1 Tiog (T 
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as rjpf, (5 —>■ 0“*'. (The rjj^ in the exponent in the denominator inside W just makes 
the exponentiated factor approach 1, and can be ignored). Further 

/ |X| \ 


— log 
Vn 




Vn 


1 


1^1 


— log — + — log 7 


Vn Vn 


So that 



But as TV —^ oo, converges to 77 , whose asymptotics are the same as the asymp¬ 
totics of e. □ 


Completion of Proof of Theorem m For the case of (G,P) = (Gi,pf) 
(dependent case) we use Proposition 10 in Section 4.2 Using the calculations of 
Section [ 4 .3| we transform the conditions of Proposition |l0| into the form seen here. 
For the case of (G,P) = (Go,Po) (independent case), we use the result of part (b) 
of Theorem The asymptotics are still controlled by the same bounds as in the 
proof of Theorem □ 


6.2. Case of n Nodes. The case Skel(G') 2 Skel(G) is the only one where 
the estimates we want to refine can arise, so throughout this subsection we are 
under the assumption Skel(G') ^ Skel(G). 

We prove a counterpart of Propositionjl^ which uses an estimate of (r(Y/v)) 
derived from Lemmain place of Corollary]^ That is we replace a bound ulti¬ 
mately derived from Chernoff’s Theorem with one ultimately derived from Sanov’s 
Theorem. First we formulate Lemma |24| in terms of the probability that drawing 
a sequence from an independent joint distribution leads to a sparsity boost with 
specified linear growth. 

Lemma 27. Let po S Po- 1 ^ \ 27^+1 ■ Then 

|-iog<(r(y^)) < ^ iV- |x|iog(iv + i)| 

is no greater than (TV + l)l^le“'''^. Let L > 0. Let Pip^ ■ ■ ■ ,Pl,o G Po (^ot nec¬ 
essarily distinct, but all product distributions) and suppose 1 < i < L are 

empirical sequences with Yi^fq consisting of N samples drawn i.i.d. fromptfi. Then 
the probability is no greater than 

L ■ (TV-f l)l^le-^^. 
that for any i, \ < i < L, we have 

- log {r{Yqq)) < 1 TV - |X| log(iV + 1). 

Proof. By Sanov’s Theorem, for every 7 > 0, 

(128) Prv„..p„ {r(F^) > 7 } < (iV + l)l^le-^^. 

Consider the events 


El ■■= {t{Yn) > 7 } 
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and the event 


i?2 := - log/?^" {t{Yn)) < -|X| log(iV + 1) + 1 ( 1 




By Lemma [Mj f or 7 satisfying the hypothesis, E 2 Q Ei. By this containment of 
events and (128) 


PrT„^Po(^^ 2 ) < Pry^^poiEi) < (iV + 

By using the definition of the event E 2 in this chain of inequalities, we obtain the 
first statement. For the second statement, we apply the union bound to the first 
statement. □ 

Proposition 21. Let C he a subset of 

{iA,B)€G'\G} 


of cardinality L. Let ifi be as in (95). Let 9 S (0,1) and let N' and N be positive 

1 V 

4 2 t ;+1 ■ 


integers such that N' = {1 — 6) ^mN. Let 7 < 77 ;^. With probability at least 


1 - - L{N + l)l^le-T'^, 

we have for all {A, B) G C, with S = S{{A, B), G) defined in Definition^ 


min — In 

sGVal(S) 


Pn> ir{Pu^^,,A,B\s)) 


11-0 1 

/ 

J " TO 25 



iV'-|X|log(iV+l). 


Consequently, 




y max min — In 





> 


1-9 L (\ 




m 25 

Proof. Take iV' > (1 - 9)~^rhN. Let 

lon' ~ P with Card(a;Ar') = N'. 

By Lemma [TTl with probability at least 

1 - 

we have that for all s G Val(S'), 

Card(YAr) = N, 

for yfi a sequence of observations meeting the criteria that the observations take 
the joint value of S' is s. 

Then we can apply Lemma Hjl to each one of the terms in the sum. Using 


Lemma 27 to estimate and the union bound, we conclude that for = Pnj 
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Card(yAr) = iV, and for 7 < 52 ^’ 


Pr 

A A o 


.G < min_ -ln/3^, (t(wjv',^, B|s)) > 


sGVal(S) 


1-6 1 
TO 25 


277+1 


-V7 


> 1 - (A^ + 


To complete the proof, we need a bound that holds for all [A, B) G £ simulta¬ 
neously, so we apply the union bound one more time, at the cost of a factor of 
L := card(£) multiplying the {N + term in the error probability. □ 

This allows us to deduce the most general form of the finite sample complexity 
for this case. 

Proposition 22. Under the hypotheses of the preceding Proposition \21\ with 
probability at least 


(129) 
we have 


1-Si- - L{N + l)l^le-^^. 


Sri{G, G', Wjv/) > 


1 - 6» L /I 


77 


T-^ - 


TV' 


m 25 \2 y 277 - 

- («(|G| - 77 ) - L\X\) log TV' - |iT(G)| log 0-1. 
As a consequence, provided that e, 7,77 satisfy 
1-9 L 


(130) 




777, 25 

there is a function N(r], e, 9, L, k, 0) such that 


For all N' > N{ri,e,9,L,K,Q), 5^(G, G', wat') > 0, with probability {129 \. 


Proof. The estimate for 5^(G, G 'follows immediately from Proposition 
21 except that we replace the subtracted term L|X|log(A^ + 1) with the larger 
subtracted term L\X\ log(A^'). The second statement follows from the first, because 
the linear term in the estimate dominates the logarithmic and constant terms for 
large TV'. □ 

This most general form of the hnite sample complexity has the usual three 
deficiencies: the statement does not make clear that the condition (130) can be 


satisfied; the error probability is stated in terms of functions involving N instead of 
being in terms of a confidence 1 — 5 chosen by the experimenter; and the asymptotics 
of N' are not immediately clear because we have not derived an explicit expression 
for N' in terms of the parameters. 

We address this by making a somewhat arbitrary, but adequate choice of e, 7 , 
in terms of 9. First, we know that we must have 


|X|log(TV + l) 
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in order for the estimate of Proposition to hold. In order to achieve a positive 
margin (which e is going to decrease) for the coefficient of N' in the estimate, we 
take, 7 to be a quarter of this upper bound, 

1 


(131) 


v/7 = 


V 

2r] + l 


Then the left-hand side of ( 130| ) is at least 

1-e L rj 
m 400 277 -I- 1 

So we choose e so that subtracting e only decreases the linear coefhcient by half: 

1-6 L r] 


(132) 


e = 


m 800 277 -I- 1 


As a result of the choices of 7 , and e in (1311 and (1321, we obtain that the left-hand 

1-6 L 77 


side of (130) is 


771 800 277 -I- 1 


> 0 , 


meaning that condition (1301 is satisfied. 

Completion of the proof of Theorem [H Each of the conditions preced¬ 
ing the bound corresponds to setting one of the subtracted terms of (1291 equal to 
I. In particular, we obtain the third bound by setting 7 = 1 ^ 2 ti+i accordance 
with (131). For N larger than the first three quantities in the statement of the 
Theorem, with probability at least 1 — <5, we have 


1-6 L 


V 


-AT- («;(|G| - 77 ) -L|A:|)logAf- |E;(G)|log0- 


Sjj{G,G', ujn) > . onn 9 1' 

777 800 277 -I-1 

Therefore, for N larger than the last quantity in the statement of the Theorem, 
Srj{G, G' ,ujm) > 0 with probability 1 — <5. □ 

7. Proofs of Technical Lemmas 

We have derived our asymptotic statements in the finite-sample complexity 
theorems with the aid of the following elementary lemma in the theory of the 
Lambert-W function. 

Lemma 28. With yV(x) := —W-i{—x), the function W has the following as¬ 
ymptotic behavior as a; —)■ 0 ^; 

Cn(— ln(a:))' 


(133) 


>V(a;) = — ln(a:) -f ln(— ln(T)) — O 


ln(a:) 


Proof. Start with the inequality 


1 exp(l/a:) , 1 

— < X < - - for-< a; < 0 

e X e 


Since W-i is the decreasing real branch of the Lambert-W function on [—e ^,0], 
by applying W-i to these inequalities we obtain 

1 < -lT_i(a:) < 


Taking logs. 


0 < ln(—lP_i(a:)) < — ln(—a;). 
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Further, by taking the natural log of the relation kF_i(x) exp(fF_i(x)) = we 
have the “quasi-recursive” identity for —W-i{x) 

—W-i{x) = — ln(x) -I- ln(—kF_i{a;)). 

Applying the previous two-sided inequalities to this we have 

1 < —W-i{x) < —21n(—a;). 

Taking the log again, 

0 < ln(—fF_i(a;)) < 2ln(—ln(—a;)). 

Applying this inequality to the right side of the “quasi-recursive” identity for 
—W-i{x) yields the asymptotic expression 

W-i{x) = ln(—a:) -|- 0(ln(— ln(—a;))) as a; —>■ 0“. 

Finally, we obtain the claimed asymptotic by substituting this identity again into 
the quasi-recursive identity. □ 


We recall a form of the big-0 notation appropriate to multivariable situations, 
which may be less familiar to some readers than the single-variable form: if x is 
a vector of variables xi, then we say that a function /(x) = 0(g(x)) as x —>■ 0*^ 
if there are constants C, M, such that provided all Xi > 0 and x~^ > M, we have 
/(x) < Cg{x). An analogous definition can be made for x —)■ oo. The following 
Lemma is an elaboration of Lemma which we can apply directly in the proofs 
of our finite sample complexity theorems: 


Lemma 29. With W{x) := —W-i(—x), we have the following asymptotic be¬ 
havior:. 

(a) yV{x) = 0(lna:“^) as x ^ 0+. 

(b) Let Xi i = 1,... fc be variables and denote the vector {xi)^ by x. Let 

fi < Q be non-positive integer exponents and Ci, C 2 real constants. We 
have 


C,l[xTw(c 2 l[x-^']=o(l[ xT Y. In 


1 


as X 


0 ^ 


(c) Let the Xi and x, as well as the Ci, fi < 0 and Cij,C 2 j be as in part (b). 
Let Xj be functions such that 






Let A = max(Aj). Then we have 


A=o 


j(ei.j) 




as X 


Proof. Part (a) follows immediately from Lemma 28 Part (b) follows from 
the observation that for fixed M > 0, if we set M' = Ai^ then if all Xi < 1 /M', 
the monomial x®’ < M. Note that we can replace the sum on the right-hand 
side of (b) with a product simply because for x~^ large all the terms In ^ >0. 
With this observation, it is not difficult to prove part (b) from part (c) using (for 
example) induction on the number of functions Xj. □ 
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An almost identical Lemma can be formulated for the case of a; —>■ oo, whose 
statement and proof we leave to the reader. 


Proof of Lemma [20J First, use 

i-i 

qA,o 

3=0 

1-1 


k-1 

9B,o = 

k-1 


PAfi = PBfl = '^Pz,o{s)- 

3=0 


2 = 0 


Thus, 


kA,0 -PA,o\ < l\\p{t) - g(s)||oo 
\qB,0 - PB,o\ < k\\p{t) - g(s)||oo- 

So we compute that 

\qA,oqB.O — PA,OPB,o\ = \qA,oqB ,0 — qA, 0 PB ,0 + qA.OPB.O — PAfiPB,o\ 
( 134 ) < qA,o\qB,0 —PB,o\ +^ 5 , 019^,0 — PA,o\ 

< {k + l)\\p{t) -g(s)||oo 

Next note that 

t = P{t)o,0 - PO.o(O) = p{t)o,0 - PAfiPBfi 
s = g(s)o,0 - 9O,o(0) = q{s)o.p - qA,oqB,0- 

Thus 

- s| < b(t)o,0 - g(i)o,o| + \qA,oqB,0 - PA.OPB.ol 


< \\p{t) - g(s)||oo + {k + l)\\p{t) - g(s)||oo <ik + l + l)lb(t) - g(s)||oo, 
where we have obtained the next-to-last inequality by using definition of the norm 


\\p{t) - g(s)||oo and (134). 


□ 


2 PN 
P 




into two cases, pN > P 


Proof of Lemma [2l We divide the event | 
and Pn < Pj handle both with the appropriate form of the Chernoff inequality, and 
then apply the union bound. First, if pN > p, then 


1 _ M 

P 


= ^ - 1 
P 


By the multiplicative Chernoff bound, first part 


Pr ^ - 1 > e !> < 
P 


Second, if < P, then 


1 _ M 

P 


= 1 - ^ 
P 


By the multiplicative Chernoff bound, second part 


Pr <) 1 _ ^ > e 1. < 

P 


Thus 


Pr 


1 - 


Pn 


>e\< 
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CHAPTER 3 


Computation of ^^-values 


We now explain the main computational part of this work, which is the com¬ 
puter calculation and tabulation of (estimates for) error probabilities . We 
describe in concrete terms not only the theory behind this computation but also 
how it is implemented in Python/Numpy/Scipy code in the packages betaMonte- 
Carlo.src and betaMonteCarlo.test. We have made the source of these packages 
Dubliclv available at |Bre m 

Notation. We recall the following notation from earlier in the paper or from 
the standard theory of types. For N > 0 and an alphabet of symbols X, we let 
Tn denote the type classes of size N over X. That is, Tn is the set of outcomes 
of N independent draws with replacement of elements from X, considered without 
respect to order in the sequence of draws. An element T € Tn can be identified with 
an |X|-vector whose entries are nonnegative integers summing to N. The origin 
of the terminology is the fact that the map associating each empirical sequence 
ujn € X^ with the type class, by taking frequency counts, is a partitioning of X^ 
into disjoint equivalence classes. Thus, we can write lun G T to indicate that ujn 
belongs to the type class T. 

Denote by pt the probability distribution in V = 'P\x\ associated to type T, 
defined as the distribution whose parameters are the entries of T divided by N. For 
p € 7^, 7 > 0, is defined as in Section]^ namely, 

(135) /3^(7) = < 7 } ■ 

For a set A C V, 1a denotes the characteristic (indicator) function of A. We use 
1^(7") to denote the value 1a{pt)- Note that we clearly have 

(136) t{p^^) = t{pt) for ujn € T. 

Along the same lines, we can expand the emission probability of T S 7)v, as 

(137) Pr,(T) = ^ Prp(a;7v). 

CJiV £7" 

For an in-depth treatment of the Theory of Types see, e.g., Chapter 11 in [CT06] 1. 

In this chapter we will denote p^ whenever possible simply by p. This not only 
reduces notational clutter, but also reminds the reader that the majority of the 
arguments apply to any p G V, not only to distributions with uniform marginals. 

1. Overview of Methods 

In our computation of Pxi'j), we carry out two computationally distinct tasks: 
(1) Prior to the running of the structure learning algorithm, compute (offline) 
estimates of P^il) for a reasonably large number (on the order of 10 ^) of 
pairs of N, 7 , and we store the results in a data structure/file. 
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(2) In the deployed version of the learning algorithm, we call a function that 
first looks up values of /3 in the tables produced in the first step, in a neigh¬ 
borhood of the requested N and 7, and then we use linear interpolation 
to return a result approximating /3^(7). 


Each of these components raises its own issues, which are actually intertwined. For 
the offline computation, the main issues are how to compute in a way that trades 
off precision of the result with speed/tractibility in a quantifiable manner. For the 
lookup/interpolation step, the main issue is precisely how to carry out the linear 
interpolation, that is on which statistics derived from N, 7,77, to linearly interpolate. 
The answers to these questions determine the makeup of the grid of N, 7 for which 
the computations in the first step should be carried out and stored. 

Because of the functioning of each step in the overall algorithm, each step has 
different practical constraints. The first step can be as slow as necessary to achieve 
the desired accuracy, but the second step must be very fast so as not to impede 
the learning algorithm. These consideration must be balanced against one another. 
On the one hand, by making the table in the first step larger, we make it easier to 
obtain accurate estimates of /3^(7) for the full range of iV, 7. On the other hand, 
we do not want to make the table too large, because we want the table to be able 
to fit into memory (RAM) in order to make the second step as fast as possible. We 
will find these issues are much easier to address after we have explored the results 
of calculating /I’s for strategically chosen pairs of N, 7. 

In more detail, for the first task, our computation of an individual PM, 
we follow, up to a certain point, well known and accepted patterns in numerical 
methods. First (Section]^, we give an algorithm for the exact computation of 
/3^(7), using its definition as the cdf of a certain discrete random variable. Second, 
(Section |3.I[ ), we replace the original definition of /3 as combinatorial object (sum 
of many different, small probabilities) with the integral of a continuous function in 
R^, approximating the original /3. Having made this replacement, we can introduce 
the standard method of integration of a continuous function over a domain in 
R” by “Monte Carlo” sampling, and specify (Section 3.2) a particular scheme for 
determining a stopping criterion for the Monte Carlo integration. At this point, 
the problem becomes how to choose a probability distribution for the Monte Carlo 
sampling which reduces the variance of the approximations and allows the stopping 
criterion to be satisfied in as few iterations as possible. We explain (Section 3.3) 
how we can achieve this goal via the strategy of importance sampling, which uses 
the information we already have concerning the integrand to construct the sampling 
distribution. In order to carry out the importance sampling strategy, we have to 
use some features which are particular to the situation at hand. Namely, in this 
situation, the selection of a point from R^ according to the “importance-sampling” 
probability distribution is equivalent to picking a distribution po(0 & ^ 2.2 according 
to a weighting scheme, with some regions of the space 7 ^ 2,2 weighted much more 
heavily than the rest, in a way based on the integrand. We break down this selection 
into two parts: first, the selection of the marginals po,A, Po,b and second, the 
selection of the t-parameter. We explain how our knowledge of the integrand guides 
each part, covering the weighting scheme for the marginals in Section |3.4| and our 
weighting scheme for the t-parameter in Section [3.5[ 
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2. Exact Computation of the Entire CDF 

We develop a method of computing the function exactly. The reason this 

is feasible is that is the CDF of a certain discrete random variable, namely the 
random variable which maps wjv, in the finite outcome space , to the mutual 
information t{ujn)- As a result, /3^(-) has a very simple form as a function: it is an 
increasing “step function”, with a certain finite number of jump discontinuities on a 
fixed interval between 0 and the maximum value of r(p) for p S Pk,i- This allows us 
to store 0%{-) concisely in a data structure that keeps track only of the discontinuity 
points 7 and the values of the CDF at those discontinuity points. For this 

purpose, a dictionary (Python for “hash map”) is a suitable data structure. It is 
convenient to define a Python class called CDF, such that every CDF object has 
such a dictionary as its main data structure, and also several methods as part of 
its API, as we will describe below. 


2 . 1 . 


Serial Computation. By (136) and (137) we can express defined 


as in (135), as follows. 


(138) 


I^Nh) = Y. 

TGTjv 


This equation allows us to develop a practical method for computing the entire 
function /3^(-) exactly. 

Any such method must have two main components: first, a method for iterating 
over all the types of a fixed size N and length n = |X|; second, a method for 
processing a given type. We first describe our method for iterating over the types, 
which is one of only one of many possible methods. The advantage of our method 
is that it makes it easy to implement a parallelized version of the algorithm, as 
we will see in Section |2.2| below. Then we will explain our method of processing 
each type, which in this case amounts to accounting for the type in the CDF. The 
advantage of separating out the processing method from the iteration method is 
that it allows us to “plug in” various other processing methods, as we will do in 
Section 12.21 below. 

In order to describe the iteration method, consider the list of prefixes of T G Tn- 
For example, the list of prefixes of the type (0120) G of length 4 consists of the 5 
prefixes [(), (0), (01), (012), (0120)]. We can think of all the prefixes of all elements 
of T/v as being arranged in a rooted tree structure of height n + 1, where the root 
of the tree is the empty prefix, the prefixes of length m are located at level m + 1, 
and the parent of a prefix of length m + 1 is its prefix of length m. Further, the 
elements of 7)v reside in the leaf level of the tree. With this definition of the tree, 
the path from the root down to T G T^v in level n + 1 (e.g., (0120)) passes through 
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the n + 1 prefixes of the type (e.g., [(), (0), (01), (012), (0120)]). In Figure 2.1 
have shown the entire tree of prefixes for the case T 3 and n = \X\ = A. 


we 


Once we have set up the tree, we can specify a method of iterating over Tn by 
giving any method of tree traversal. For reasons which will be seen below in Section 

class and recursive DFT processing of the tree nodes: 

1 
3 
5 

9 
11 
13 
15 
17 
19 
21 
23 
25 
27 
29 


class TypePrefix 


:?^Constructor 

def typePrefix (N, data = [] , n = 4): 

New typePrefix with 
size := N 
length := n, 

data := data, where len(data) <= n sum (data) <= N 
processed := False 


^T^Methods 

def boolean hasChildren(): 
return len(data) < n 


def gener at or <t y peP r ef ix > chi 1 d r enGene r at or ( ) : 
r := N — sum(data) 
if len(data) < n — 1: 

return g e n e r at o r ( t y p e P r e f i x (N, d at a + [ i ] ,n) for i in 
else: 

return generator( typePref i x (N, data + [r],n) for i 

range ( r + 1) ) 

n range(l)) 

def void DFSProcess ( processMethod ) : 
if hasChildren () : 

for child in c h i 1 d r e n G e n e r at o r ( ) : 
if not child, processed: 

child . DFSProcess ( processMethod ) 

else: 

processMethod (data) 
processed := True 



2.2 we use depth-first traversal (DFT). We present pseudocode for the TypePrefix 


Because only the leaves of the tree are true elements of Tiv, it is only at the leaf level 
that any actual processing is done: for internal nodes, which correspond to proper 
prefixes of types, processing the node means processing the node’s children. In our 
implementation, we have the flexibility to pass in to the method DFSProcess any 
method processMethod which takes a TypePrefix as input. 

It will be easier to explain the processMethod we use in the context of the 
API of the CDF object. A minimal implementation of the CDF object must have 
the following methods: 

• setReferenceDistribution(eta) Set the underlying distribution to be 
p** with reference to which the emission probabilities of types are calcu¬ 
lated. 

• accountForEvent (probability,value) Recall that the CDF object stores 
a representation of a locally constant, monotone increasing function, with 
finitely many discontinuities. This method modifies the CDF function by 
adding the step function probability x l{>vaiue} to the CDF function. 
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Figure 2. Running time of exact calculations (serial implemen¬ 
tation) and Monte Carlo estimations of 0^ for a pair of binary 
marginal random variables, i.e., n = \X\ = 4, as a function of 
N. Log-log plot shows the approximately quartic dependence of 
running time on N. 


assignCumulativeProbability (value) 
function at value. 


returns the value of the CDF 


The implementation of these methods will depend on the data structures used inter¬ 
nally by the CDF object to store the CDF function. We have used a sorted list of 
discontinuity values and a dictionary of pairs (discontinuity, CDF(discontinuity)), 
because this is the easiest to implement, but more space/time efficient implemen¬ 
tations are certainly possible. 

In this framework, we can use (1381 to see how to define the processMethod 
which will give us the exact CDF: 

1 def accountForType (T) : #T a type class 

p := r e f e r e n c e D i s t r i b u t i o n 
3 P-T := empirical distribution of T 

j accountForEvent (tau (p_T) , Pr_p(T)) #Pr_p(T) emission probability of T 


In order to compute the entire exact CDF, we make calls to the driver method 

accountForTypesWithPrefix of the CDF class: 

def account For Ty pesWit hP ref ix ( aTy pePrefix ) : 

2 aTypePrefix. DFSProcess (accountForType ) 

4 def accountFor AllTypes ( s e 1 f ) : 

rootPrefix = t y p e P r e f i x (N, [ ], n) 

6 accountF or Ty p es Wit hP refix ( rootPrefix ) 


The method accountForTypesWithPrefix will be useful in the parallelization. 
Section 12.21 

Since the number oftypes |7)v| is exactly the running time ofthe exact 

calculation algorithm is D(iV"“^). In our experiments, Figure [2T| the algorithm for 
the case of n = |X| = 4, appears to run in at least quartic time. It might be possible 
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to wring out some improvement in the running time by further optimizations to 
the serial algorithm, but it is much easier to achieve a speed-up through a simple 
parallelization of the algorithm. 

2.2. Parallel Computation. The idea of the parallel computation is roughly 
to break down the task of computing the CDF by “branch” of the tree of type 
prefixes, where a branch is a subtree having a single type prefix as common ancestor. 
For each branch, we form a separate process. The processes are distributed out to 
different cores and each process returns a CDF object accounting for the types 
in that branch. When all the processes have returned their CDF objects, we 
merge (add) the CDFs into a single CDF, accounting for all the types. Although 
the computation is not embarrassingly parallel, the parallelization scheme we have 
implemented agrees closely with the MapReduce paradigm |DG08| . 

In order to carry out the parallelization, we need a few auxiliary methods: 

• typePrefix.childrenListLastEntry k Mod m(k,m): return a list of 
the children of a typePrefix whose last (right-most) entry is congruent to 
k modulo m. 

• accountForListOfTypePrefixes(typePrefixes[ ], N, n, refDistri- 
bution). Returns a new CDF object accounting for all the types which 
are descended from any of the prefixes contained in the given list of type- 
Prefixes. 

• CDF.mergelnto(anotherCDF) Merges anotherCDF into the given CDF 
object, producing a CDF whose function is the sum of the functions of 
the given CDF and the other CDF. 

The first two methods are simple generalizations of the procedures childrenGener- 
ator and accountForTypesWithPrefix, for which we have already given pseu¬ 
docode above. The implementation of CDF.mergelnto(anotherCDF) is depen¬ 
dent on the representation of the CDF function internal to the CDF object, so we 
will not give pseudocode for it here. 

All that remains is to give pseudocode for a “driver” function that carries out 
the parallelization strategy: 

def void acco u n t F o r A11 Ty p es P ar a 11 e 1 iz e d (m) : modulus 

rootPrefix = t y p e P r e f ix (N, [] , n) 

rootsForParallelJobs = [ r o o t P r e f i x . c h i 1 d r e nL is t L as t E n t r y _k _M o d _m ( k , m) for k 

in range ( modulus ) ] 

^Mapper step : distribute to collection of nodes 

ListOfCDFs = [accountForListOfTypePrefixes(aRootList ,N, n,refDistribution) 

for aRootList in rootsForParallelJobs] 

^Reducer step 

for CDF in ListOfCDFs: 

mergeinto (CDF) 


Because the mapper step is embarrassingly parallelizable, we can use the Paral¬ 
lel... delayed idiom of the Python joblib library to parallelize the “mapper” step 
and to farm out each one of the accountForListOfTypePrefixes jobs to one of 
the available cores of the processor. With this very simple implementation, we were 
able to calculate the entire CDF for the case N = 200 on an eight-core processor 
in roughly 2 hours. If we wished instead to use a cluster of distributed proces¬ 
sors, we might use a MapReduce implementation such as Hadoop, and then the 
call to accountForListOfTypePrefixes would constitute the action of each of 
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the Mappers. The loop closing the pseudocode would constitute the action of the 
Reducer. 

The colors of the branches in Figure [^illustrate this parallelization procedure 
in the toy-example case of fV = 3, n = 4. Each of the colors indicates those types 
which fall under a given element of the list rootsForParallelJobs, provided we 
take the modulus m > 4. For this very small example, there is a severe imbalance 
between the number of types in each branch, with the branch under the prefix (0) 
containing ten times as many types as the branch under the prefix (3). In the 
case when both N and the modulus m are significantly larger than the number of 
cores, which is normally the case, this problem essentially resolves itself because, 
whenever a node happens to be assigned a branch with very few types, the node 
will finish its computation early and will then be free to be assigned another branch 
to work on. 


3. Monte Carlo Integration 


3.1. Replacing Summation -with Integration. We will begin by address¬ 
ing the question of how to estimate /?^(7), in a way that is scalable to several 
thousand pairs of N, 7, with some N in the range of 10^ to 10®. In order to obtain 
a scalable method of computing /3, we would like to find a way to apply stan¬ 
dard techniques of numerical integration such as quadrature (e.g., Riemann sums) 
and Monte Carlo Integration, because one can run such methods iteratively with a 
stopping criteria and obtain estimates that are (provably) good enough for certain 
purposes. Our first significant step is to replace the sum in (1381 with an integral 


approximating the sum. Naturally, the region of integration for this integral is A^. 
The choice of integrand is based on the following basic estimate in information 
theory. 

Lemma 30. Let T S Tjv have the property that {T]sr)i yf 0 for i = 1,..., \X\; 
equivalently, the associated probability distribution px G V assigns nonzero proba¬ 
bility to each atomic event {w}, for uj G X. For some 0 < 'd(T) < 1, we have 

(139) 

Prp(T) = eM-NHipT\\p) - d{T)\X\/12) ■ (2^7V)-(l^l-i)/2 j [](pt)..j 

\ ij 

Consequently, we have the upper bound 

(140) Prp{T) < exp{-NH{pT\\p)) ■ (27riV)“(l^l“^)/^ I J|(pT)ij 

\ i,3 

Proof. It is well known (and follows immediately from Theorem 11.1.2 of 
|CT06| 1 that 

Prp(r) = |r| exp(-iV(R(pT) + HiprWp))), 
where |T| is the cardinality of the type class T. From Robbins’ sharpening of 
Stirling’s formula, it is possible to find an estimate for log |r| (see p. 39 of )CK11| 
for details). Exponentiating Robbins’ formulation we obtain: 


|T| = exp(7VR(pT) - ^T)\X\/12) ■ (27r7V)-(l^l-i)/2 [](p^), 
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for some i?(T) e [0,1]. We use this expression, under the assumption that (Tjv)^ ^ 0 
for i = 1,..., |X|, or equivalently, that PT{a) > 0 for every a G X, to obtain (139). 
(Note that the definition of log used in |CK1H is log 2 , i.e., number of bits, which 
explains the appearance of the Ine which does not appear in our formula). As 
an immediate consequence of the nonpositivity of —'d{T)\X\/\2, we then obtain 

(fTiol). □ 


We use the right-hand side of (140) to define a function I “extending” Prp(-) 
to all oiV. Namely, we set 

B-={p& v\ p{{uj}) > 0 V w G a:}. 


For q ^ B, = 0, so the factor on the right-hand side of ( |140[ ) which is 

the reciprocal of the square root of this product is oo. Thus, in our definition of /, 
we have to account specially for points q GV — B. For any g G P, we set 
(141) 

'O \iqGV - B, q(^pr^ 

/(g) := <1 Prp(g) \iqGV - B, qGpr^ 


(n,j 


exp(—A^i/(g||p)) • {2ttN) 

By Lemma / has the following property: 

(142) / > Prp where they are both defined, namely onpj-j^. 


otherwise, i.e., if g G il. 


Also, since the interior Int('P) C B, and since the third line of (141) is a continuous 
function of g, we have 


(143) 


/ is continuous on lnt(7^). 


The practical advantage of /(•) over Prp(-) is that I is defined on every point of 
lnt(7^), not just the discrete set of points ptn corresponding to Tat- Attempts to 
extend the emission probability “directly” (or using some sort of interpolation) to 
the whole probability simplex entail complicated and slow and error-prone rounding 
and truncation procedures. Since we are concerned with obtaining a numerical 
approximation to quantities based on Prp(-), we derive a considerable advantage to 
use a continuous approximation such as /(•), instead of Prp(-) itself. 

Because of the property ( |142[ ), we have 

(144) (3M= E P^p{T)lA2iPT)< E ^(pt) 1 aj(/'t), 

TGTn TgTn 


where 1^^ is the characteristic function of the set Aq = {p G V 
Lemma 30 we only have that I provides an upper bound for Pr. 


P’ 


t(p) < 7}- By 
but we have no 

information on how tight this upper bound is. In order to establish that the bound 
is tight for large N, we have carried out both summations in ( 144 ), in the script 
scripts.robbinsEstimationAccuracy.py, and reported the results in Figure |3T| 
The CDFs graphed in Figure 3.1 show that the bound in (144) becomes tight for 
large N, particularly for 7 < 77 , which is the region on which the approximation 
has to be most accurate for the purposes of our algorithm. So we may write 


/3a( 7 ) « E 3"(/'t)1as(pt) for large iV. 

TgTjv 














ftra' (7) l 3 !o (7) 
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0.1 

Figure 3. Values of /3^ calculated by summation over all type 
classes, using the exact probability function PrpO.i and also using 
the estimating function I evaluated on the points ptn- Results, 
shown for N = 50,100,150, 200, essentially show the Ll-error of I 
as an approximation of Pr^o i, on Ag, for all feasible 7 simultane¬ 
ously. 


Now extend I from 7^ to a function I on all of R" by 

0 otherwise. 


The properties (142) and (143) hold for I since they hold for I. Let A C V and 
let dq denote the ordinary Lebesgue measure. Then by (142) and the properties of 
Riemann sums, we have 


N- ^ /(pt)Ia(pt) I{q)dq as N 


00. 


TgTn 
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Since dim(7^) = |X| — 1, we have 

/(p'r)lA(PT) ~ f I{q)dq for large N. 

TgTn 

Putting the previous two approximations together, we have 

(145) Pn^'I) ~ f I{q)dq for large N. 

Ja° 

More explicitly, 

[ I{q) dq = • [ exp{-NH{q\\p)) ■ (2TiV)-(l^l-i)/2 | [](g) 

^■40 Ja- 

= (iV/2^)(l^l-i)/2 eM-NH{q\\p)) 

= J^exp{-NH{q\\p)) 1 a « 

The numerical evaluation of this integral breaks down into two distinct tasks. 

(1) Coding the integrand as a function 

(2) Writing a function that integrates such a function over V, using Monte 
Carlo integration based on importance sampling with a standard stopping 
criterion. 

For task (1), we form an object of type emissionProbabilityCalculator, param¬ 
eterized by the variables {k, I, rj, N), which remain fixed throughout the compu¬ 
tation. In our software, the method RobbinsEstimateOfEmissionProbability- 
TimesCharFunctionOfTauMinusGamma takes the marginals and t-parameter 
oi p GV and returns the value of the integrand at p. 

Task (2), the integration, is made up of two subtasks. The first is implementing 
a general procedure for MonteCarlo integration. The procedure iterates until the 
measured variance is “small”. By “small” we mean small enough that we, for a fixed 
percentage accuracy, say PrecPerc, close to 0 and a confidence, say Conf, close 
to 1, we conclude that the result is PrecPerc percent accurate, with confidence 
Conf. For the first subtask we implement a standard scheme following Section 


4.5 of |Buc04| ■ as decribed in Section 3.2 below. The second subtask is choosing a 
probability distribution on the space V for the MonteCarlo integration, based on our 
prior knowledge of the integrand, that reduces the variance and induces convergence 
in as few Monte Carlo iterations as possible. This part, which is explained in Section 
3.3| and is by its nature particular to the situation of the particular integrand at 
hand, is essentially our novel contribution. 


3.2. Monte Carlo Integration with stopping criterion. We now give a 
high-level view in pseudocode of the MonteCarloIntegrate procedure. As input 
the procedure takes the following parameters, to be specified below in more detail 
relevant to our situation: 

• f, the integrand, a function on V. In the application f is I ■ lyij- 
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• pdf, a nonnegative function on V representing a probability distribution 
absolutely continuous with respect to Lebesgue measure on V. 

• Maxit, the maximum number of iterations, a positive integer. 

• freqRec, a frequency of recording intermediate results, a positive integer. 

• freqCrit, a frequency of testing for the stopping criterion, a positive 
integer 

• PrecPerc, the desired precision percent of the result, a small integer, 
typically 10. 

• Conf, the desired confidence level in the precision of the result, a float 
near 1, typically 0.95. 

As output, the procedure has the lists it Vais and rhoHatVals such that for each 
index i> 0 , the following hold: 

• rhoHatVals[i] is the Monte Carlo estimate of the integral at iteration 
it Vais [i] and 

• rhoHatVals[-l], the last value Monte Carlo estimate in rhoHatVals, is 
within PrecPerc percent of the true value of the integral with probability 

Conf. 


The latter is a heuristic statement rather than a guarantee: it must be verified 
empirically (see below). The framework procedure has the following pseudocode: 


1 

def MonteCarloIntegrate(f, pdf, M ax It, freqRec, freqStop, PrecPerc, Conf): 


t_Conf := two-sided quantile of the standard Gaussian random variable at Conf 

3 

K := (t_Conf*100/PrecPerc) '2 

5 

rhoHatVals := [] 


FHatValues := [] 

7 

itVals := [] 


s :=0 

9 

o 

II 

11 

Iterate until stopping criterion or Maxit iterations has been reached: 

13 

Draw distribution q according to pdf 


s-summand := f(q)/pdf(q) 

15 

F_s_summand := s_summand*2 


s += s-summand 

17 

F_s += F_s_summand 


Once every freqRec iterations : 

19 

I := s/(Number of iteration) 


rhoHatVals . append ( I ) 

21 

itVals . append( Number of iteration) 


Once every freqStop iterations : 

23 

F := F_s/(Number of Iterations) 


i f Number of Iterations >= K*(F/I'2 — 1 ): 

25 

Invoke Stopping Criterion and break from loop 

27 

return itVals , rhoHatVals 


A few comments on the procedure MonteCarloIntegrate are in order: the 
two-sided quantile of the standard Gaussian random variable at Conf is defined 
as the unique value ty satisfying P[\Z\ < ty) = Conf for Z the standard Gaussian 
random variable. In Python/Scipy this can be computed easily by 

1 from scipy import stats 

t_y = stats, norm . isf((l—y) / 2.0) 


and in our application y is Conf. For example, for Conf at 0.90, we have tconf ~ 
1.644, for Conf at 0.95, we have <conf ~ 1.959, and for Conf at 0.99 we have 
iconf« 2.57582. 
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The stopping criterion we use is based on a heuristic rather than a theorem. 
To explain this heuristic, let p be the true value of an integral which we are 
computing by the Monte Carlo method, and let p be the empirical estimate for 
p achieved after a certain number of iterations. Thus, the random variable p — p 
is the empirical error in the Monte Carlo estimate of the integral. For 
our stopping criterion, we would like to prove that if a certain condition is satisfied, 
then we have 

(146) Pr(|,i-H<f^^)>Co„f. 


Under the assumption that p — p is a Gaussian r.v. centered at zero and with 
standard deviation ^JWa.v{p), it is not difficult to prove the relation (see, e.g., (4.6) 
in |Buc04| l 


Var(/3) = 




Number of iteration’ 

where F is the quantity calculated in the procedure MonteCarloIntegrate. Using 
some simple algebraic manipulations (see p. 71 of [Buc04] l we can obtain from 
this relation that (146 1 is satisfied provided that the condition 


Number of Iterations >= K*(F/I"2 — 1 ) 


on line 24 of the pseudocode for MonteCarloIntegrate is satisfied. 

As for the assumption that p — p is has a Gaussian distribution, we do not 
have any a priori reason to believe this assumption is satished in our situation. 
We nevertheless use the stopping criterion outlined above, because it is simple to 
implement. The justification for using the stopping criterion we have adopted in 
MonteCarloIntegrate is an empirical one, namely that, for large N at least, by 
using this stopping criterion with PrecPerc equal to 10 and Conf equal to 0.95, 
we do achieve estimates to the true values of /? that are well within the 10 % error 
margin in 95% of cases. We have recorded the empirical results supporting these 
assertions in Figure These empirical results do not validate the assumption on 
p — p per se, but they do suggest that a Gaussian distribution models p — p well 
enough in practice that our entire procedure produces estimates p that are within 
PrecPerc percent of p at least Conf of the time. 

In the procedure, it may make sense to use the uniform distribution as the 
sampling distribution if, for example, we know nothing about the integral besides 
its integrability. In the case of the integrand /, because only a very small part V 
lies in especially for small 7 , it is an extremely rare event that a point chosen 
by the uniform distribution contributes to s anything at all, much less, anything 
significant. As a result, for a random, very small proportion of the iterations, the 
integral estimate rhoHatVals will record a (relatively speaking) very large jump. 
Further, even less frequently will a point of V chosen randomly according to the 
uniform distribution have test statistic r less than 7 and simultaneously lie close 
to the Tprojection of onto A^. For N large, such points account for most of the 
mass of the integrand (as can be seen qualitatively from Sanov’s Theorem), so such 
a small proportion of the iterations should lead to even more significant jumps in 
rhoHatVals. The result is that it will take very many iterations for the variance 
of the estimates to descrease enough for the stopping criterion to be satisfied. 
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7 7 


N 

0.001 

0.005 

N 

0.001 

0.005 

20 

12.20% 

29.15% 

70 

1.0.3% 

3.59% 

.30 

24.65% 

1.18% 

80 

9.06% 

4.1.3% 

40 

39.77% 

7.07% 

90 

0.77% 

4.05% 

.50 

2.45% 

4.887o 

100 

1.01% 

0.03% 

60 

3.52% 

0.30% 

110 

2.27% 

3.01% 


Figure 4. Illustrations of the effectiveness of Monte Carlo rnethod 
with stopping criterion at producing estimates /3^ ( 7 ) of ( 7 ) 

within 10% of the true value. Top row: improvement in the esti¬ 
mates of /3^ (0.001) and /3^ (0.005) as N grows from 10 to 110. 

Bottom left: for N fixed at 200 improvement in accuracy of esti- 
0.01 

mated ( 7 ) 7 O'*". Bottom right: Multiplicative error 

\f^N ~ \/l^N ^ varies from 20 to 110 . 


These problems with uniform sampling affect most markedly the cases of small 
7 and large N. From the point of view of the application of the algorithm these 
cases are of the most interest to us, so we must address them. Fortunately, in the 
above Monte Carlo algorithm, we can use a nonuniform sampling distribution pdf 
tailored to reduce the variance of the estimates of the integral of /. Such a variance 
reduction method is called an “importance sampling” scheme. 

3.3. Importance Sampling Overview. What now remains to explain is 
the part of the calculation that is the most delicate and potentially open to further 
refinement: the choice of the measure for the Monte Carlo sampling. In order to 
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explain this, we first specialize to the case oi k = I = 2, for which the map F from 
the marginal-t coordinates to the contingency-table coordinates is defined explicitly 
by 


F ■ {PA,PB,t) i-A 


PAPB + t 

{1-Pa)pb - t 


Pl.l Pl.2 

_Pa(1 - Pb) - i 

* 


P2.1 * 


The map F is bijective from to the copy of embedded in the upper-left 
corner of the space of 2-by-2 contingency tables. (The lower right coordinate is 
a function of the first three coordinates, namely * = 1 — pi i — pi 2 — P 2 ,i)- The 
inverse of the bijection F is given explicitly by 


ic-i 


Pi.i 

P2,l 


Pi,2 
* 


i-A 


Pl,l +P2,1 
Pl,l +Pl,2 

Fi,i - (pit +P2.i)(P1,1 



The mapping F in effect provides a change of coordinates map on V C R^. On 
the one hand, it is more convenient to think of the domain V as defined in terms 
of equations in the coordinates pip, Pi. 2 , P 2 ,i in R^, since in these coordinates V 
is simply the unit cube. On the other hand, it is more convenient to think to 
define the pdf of the sampling distribution in terms of the coordinates pa, Pb^ 
because of our knowledge of the integrand / and the way we can use the theory of 
Large Deviations to extract information about /. We will explain the procedure in 
detail in Sections 3.4 and 3.5 below. First, we validate this procedure through the 
following lemma. 


Lemma 31. The Jacobian JJF of the bijective mapping F is unimodular. Con¬ 
sequently, the Jacobian fJF~^ of the inverse mapping F~^ is unimodular. 


Proof. It is a straightforward calculation, from the partial deriviates of F 
taken in the order ^ that 

( Pb 1 - Pb -PB \ 

PA -PA 1 - PA • 

1-1 -1 J 

(The Jacobian fJF{pA,PB,t) turns out to be constant in the variable t.) From this 
one computes that det JF{pA,PB,t) = 1, as claimed. Therefore, 

detJF~^ (pi,i,Pi, 2 ,P 2 .i) = detJF~^ {F{pA,PB,t)) = (det JF{pA,PB,t))~'^ = 1, 
also, as claimed. □ 


The significance of Lemma [3T] is as follows: if we denote the Lebesgue measure 
of R^ with respect to the coordinates dpipdpi_ 2 dp 2 ,i and define ^ to be a measure 
with density function <i> with respect to the {pA,PB,t) coordinates, i.e. so that 


d/i := ^ dpAdpBdt, 

then the following holds for all measurable A C R^: 

(147) 

p{A) := dp= ^dpAdpsdt = / $ • det j7F“^dpi,idpi,2dp2,i = / 4>d9. 

Ja Ja Jf(A) Jf(A) 


In (147), we have, as above, used the notation dg for the Lebesgue measure on R'^ 
with respect to the coordinates Pij. 
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Now we will define $ (thus defining fi) as a product, 

(148) ^{pA,PB,t) := (pNiPA)PN{PB)i’N,'yit), 

where each of the functions ipN, is a Gaussian pdf on R with total integral 1 
and scale depending on N. In the case of both the scale and the center (i.e. 
location of the mode) of the Gaussian depend on N and on 7 . 

The main reason for adopting such a simple formula for ^(j)A,PB,t) is expedi¬ 
ency, but we also have to justify the choice of $ in (148) by showing that it really 
leads to a bona fide probability measure p, on R^. Fortunately, this easy easy to 
accomplish using (147) and Fubini’s Theorem, and we obtain the following result. 


Lemma 32. With $ defined by (148) and the pN o,nd Gaussian pdf’s on 
R with total integral 1, we have 


/ <^>{F i(pi.i,Pi. 2 ,P 2 ,i)) dg = 1, 


where dg is the Lebesgue measure on R^ with respect to the coordinates (pi,i,_Pi, 2 ,P 2 ,i 
That is, the measure $ o F’^^dg is a probability measure on R^. 


As a result we are justified in using p with density function (148) as the pa¬ 
rameter pdf in the function MonteCarloIntegrate procedure listed above. 

3.4. Sampling the Marginals. In constructing ip^, according to the general 
principles of variance reduction in Monte Carlo integration, we wish to choose pN 
so that the following proportionality relationship holds: 


(149) 


PN{PA)oe. / ii o F] ■ {1 a^ o F) {pA,PBfi) <^PB<^t. 
jRxR ^ ' 


Recall that /(g) approximates the probability of emission of g by p'’^ Numerical 
experimentation shows that, to first approximation, the variation in the magnitude 
of (149) is primarily due to the variation in the factor exp(—Afi/(g||p’*)) inside I, as 
opposed to the factor 1a.,- Therefore, we replace ( 149|) with the simpler condition 


(150) 


Tn{pa) / [I o Fj {pA,PB,t) <ipBdt. 
jRxR ^ ' 


By standard information theory the right-hand side of (150), as a function of pa, 
closely correllates with 

{p{ujn)a = pa}- 

By the central limit theorem, the most likely values of p{uji<i)a are concentrated 
around p\ = \- Thus, we should choose the mode of (pjv for all N to be located at 

Now that we have determined the mode of the Gaussian it remains to 
determine how to scale pN- Recall that Chernoff’s inequality implies that 


Pr, 




p{ujn)a - ^ 


> n < e 




We derive a heuristic from this inequality which says that pn should be defined so 
that there is a “suitable” sequence (to be specified precisely below) so that 


(151) 


' {x I |a:-l/2|>tjv} 


(pN{x)dx = e 


-Nti, 
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Since (pN is a pdf on R, we may write (1511 in the more convenient form 

rh-\-iN 


(152) 


ipN{x) dx = 1 — e 




Since (1511 and (152) can hold true for one sequence of {tAr}, we will choose {tAr} so 


that the right-hand side of (1511 and (152) are both fixed at a certain (convenient) 


constant value. Namely, we will choose a constant CentralProbability in (0,1), 
and set 


(153) 


In — 


— log(l — CentralProbability) 


N 


so that (152) becomes 
(154) 


1 


' h-tN 


(Pn{x) dx = CentralProbability. 


The most convenient value to take for CentralProbability is 
CentralProbability := 0.6826894921, 

the proportion of the mass lying within one standard deviation of the mean of a 


Gaussian, because then (154) simply says that In is the standard deviation (scale) 
of the distribution with pdf (p^. 

In our source module informationTheory, we implement the above calcula¬ 
tion of (/?Ar by defining a function, called ChernoflfRadius. Then t, the scale of 
ifN, is computed with a call to the function as follows: 


1 t = src. informationTheory. ChernoffRadius( CentralProbability, N) 


where CentralProbability is defined by (3.4). 


3.5. Sampling from a family of fixed marginals. For the description of 
' 0 Ar, 7 , we focus on the most important case: that is when when 'j < rj and the 
marginals match the marginals of In this case, that means the case when the 
marginals {pa,Pb) = (^jj) = ( 515 ) ■= Po- We define the following auxiliary 
notations: 

y = inax{r(po(t)) < 7}, 

= min{r(po(i)) < 7}, 

and 

(155) e°=t+-t-. 

We therefore center the the Gaussian '4’N,'y at , the maximum of the integrand 
within A°. 

The scale of is y — tAr, where is a sequence of points belonging to 
whose construction we now explain as follows: restrict the integrand I to 

the set 

\{PA,PB,t) eP I {pa,Pb) = fb’b) ’ K’^7] j- 
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The integrand can be written as a function of one variable t G [t^ ,t^], parameter¬ 
ized by iV G N, 

/ 7 v(f) := I{po{t)) = exp{-NH{po{t)\\p'^)) | J|(poW)*j 

V i,j 

It is not difficult to see that the family {/Arj^gisj of functions from to the 

reals has the following properties: 

Each /at is defined on the same interval [t ~, tiji] ■ 

For each iV G N, > 0. 

For all but finitely (small) N G N, is increasing as a function of t on 
[t-,t+]. 

Let p be a fixed ratio p < 1. Then by the preceding two properties, for all 
but finitely many TV G N, 

Either p • fnit^) < fNit~) or there exists a unique 

tN G such that /at( iiv)//«(!:]:) = P- 

Following the last property, for all but finitely many N, we can define tpf as the 
unique value in such that /iv(lAf)//N(iiJi) = P, if such a exists, and 

otherwise set = t~. For the remaining finitely many N (those for which is not 
increasing on , tiji] ), we can define to be the supremum of those G [t~ ,t'^] 
for which fN{tN)/fN{t'^) = P, if such a exists, and otherwise set In = t~. 

Therefore, in order to specify completely, it remains to define the fixed ratio 
p. Let Z(z) be the pdf of the standard Gaussian random variable, where z measured 
is in units of a, and set 

p:=Z(l)/Z(0)« 0.60653... 

It follows from the definition of In in terms of p in the previous paragraph that, 
for all but finitely many N, '4’N,'y has the properties that 

argmaxrj-j+,i/>Ar, 7 (t) = argmax/7v(t) = and ^ = p. 

Thus, ipN,'y realizes a Gaussian that approximates the shape of the integrand I on 
the path p°{t), t G of uniform marginals. 

This explains the choice of i/’ai ,7 on the fiber of uniform marginals. For the 
other fibers {p{t)} (with p G Vo), we again pick the Gaussian ipN,^ to have its 
maximum at the point po (t+) at which the path {p{t)} meets the boundary of A^, 
because at least on the fibers close to the fiber containing p°, p{t)^) is still the 
maximum point for the integrand. It is too time consuming though to recompute 
separately for each fiber. Although it does not make sense to reuse from 
the uniform-marginals path we can extract a statistic from In which we call the 
scaleRatio and use that instead. Namely, the scaleRatio is defined as the ratio 
of the scale from the previous calculation to the length of the entire path: 

, , ~ tN 

scaleRatio(fV) := ^ -. 

For the general path of fixed, but not necessarily uniform, marginals, we define 
in an analogous manner, as the length of the segment of the path contained in 


(156) 
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A^. We then set to be the Gaussian centered at and with scale equal to 
scaleRatio • (where both and are calculated with reference to the path in 
question, not the p°-based path). 

4. Interpolation 

In this section, we address the following problem: on the one hand, while any 
table of precomputed values must be finite, indeed must fit in RAM 

to be practical. On the other hand, in the implementation of the algorithm to 
learn structures from data, we will inevitably have to evaluate (3^^ at a continuum 
of 7 values and for arbitrary integer values N. This reality means that to put the 
algorithm into practice, we have to carry out some sort of interpolation. For reasons 
of speed, linear interpolation is the most feasible method, but a few moments 
thought will suffice to show that the dependence of /3^’' on 7 , N cannot be close 
to linear. Fortunately, through a combination of heuristic reasoning inspired by 
the Theory of Large Deviations and exploratory data analysis, we can discover the 
following: the dependence of log /3 on N is roughly linear. Likewise, the relationship 
of log/3 to a certain fairly simple statistic (KL(p'>'||p’*)) derived from 7 , rj is roughly 
linear. The emphasis in these statements is on the range of values that we most care 
about, namely “moderately” small 7 and moderately large N. Our tasks in this 
Section are as follows: first explain our choice of the statistics; second, demonstrate 
the claimed approximate linear dependence of log /3^ on these statistics; and, third, 
explain how the choice of interpolating statistics affects the choices (3^ ( 7 ) we choose 
to compute and store in the table. 

4.1. Statistics for linear interpolation. The relevant statistic for linear 
interpolation of log/3 in 7 is the KL-divergence ¥Aj{p^\\p^) = (Recall 

that p'^, like p^, by definition has uniform marginals.) The reason for adopting 
this statistic as the basis for the linear interpolation is ultimately empirical, as will 
be illustrated below. However, we also give an intuitive reason for why it would 
make sense to expect a linear relationship between log/3 and KL{p'^\\p'^). Namely, 
Conjecture says that in the cases that are of greatest interest to us p'*' is the 
I-projection of p'^ onto A°. One form of Sanov’s Theorem, applied to the closed set 
A°, yields 

^ log/^N ( 7 ) = -KL(p'^||p''). 

In fact, fixing several (reasonably small) values of N, namely N = 90, N = 
200, N = 900, N = 9000 and graphing /3 versus KL(p^||p’'), we clearly see the 
linear relationship emerging, especially as N grows (Figure]^. 

Note that, because we are only presenting data for the case j < rj, which 
is the case which matters most for the structure learning algorithm, a larger KL- 
divergence on the x-axis means a smaller 7 . The reason that, in the data we present, 
the maximum of KL(p'''||p’’) grows from 0.007 to 0.01 as N grows will be explained 
in Section [T^ below. 

The linear relationship between N and log /3^ is empirically demonstrated if 
we fix 7 and graph the values of log [3^ for fixed 7 and N varying in the range from 
N = 5 to N = 9000 (Figurej^. It is clear that (for the values of 7 we care about, a 
sample of which are shown here), the linear relationship is sufficiently strong once 
N reaches the thousands range. When N is in the range of hundreds, as shown in 
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Logarithmically scaled plots of/3 versus the statistic KL(p^ \p ’^) 




0.000 0.002 0.004 0.006 0.008 0.010 


A^=9000 



0.000 0.002 0.004 0.006 0.008 0.010 0.012 


KL(p^ Ip’' ) 


KL(p^ Ip’' ) 


Figure 5. Approximate linear exponential decay of f3 in the 
ArL(p'’'|jp''), T] = 0 . 01 , as 7 varies, in the top row, from approx¬ 
imately 1 X 10“^ and in the bottom row, from approximately 0 up 
to to p = 0 . 01 . 


Figure the linear relationship between log /3^ and N is already strong enough 
except in the case of the largest 7 considered. Because the sparsity boost in the 
algorithm makes the least difference when 7 is large (equivalently, 7 is close to 
p, equivalently, the KL-divergence considered is small), we find it safe to adopt 
the linear interpolation method in this range of N. Finally, in the case of N in 
the range less than 200, see Figure we see that the linear relationship breaks 
down, and indeed any definite trend in /3^ is difficult to discern. (The estimates 
of also become more noisy for N in this range.) In this range we can exactly 
calculate /3^ , so one approach would be to simply use the exact values for each 
N. Doing this ends up adding unnecessary complexity to the algorithm, so we keep 
using estimated values and linear interpolation. We compensate by decreasing the 
interval between N whose log /3^ 7 we are recording in the table from N = 1000 in 
the largest ranges, to = 100 in the intermediate ranges, and finally to IV = 10 or 
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Figure 6 . Same data as in Figure but plotted with respect to 
the 7 instead of the statistic KL(j>^\\p^). This illustrates why we 
interpolate with respect to the statistic KL{p'^\\p'^) instead of 7 : 
the growth of f3(pf) with respect to 7 itself is not linear exponential. 


iV = 5 in the smallest range N < 100. With these choices of N intervals, the linear 
interpolation procedure yields a value for log 0^ ( 7 ) which is close in magnitude to 
the ones for very nearby N actually recorded in the table. Note that for the smaller 
values of 7 considered in the ranges of smaller N^ we haven’t recorded enough data 
to graph. This is reflected in Figure]^ by the increasing value of N at the left edge 
of the graph as we move from upper left to lower right. The reason for this will be 
explained in Section 4.2 below. 


4.2. Building the table. We now come to the final topic, namely choosing 
the two-dimensional array of (TV, 7 )-pairs, for which we actually compute and tab¬ 
ulate / 3 ^( 7 ). We have saved this topic for last because the decisions we make in 
building the table are informed by all the previous considerations coming both from 
the structure-learning algorithm itself and from the empirical results of Section [4.1| 
on the numerical properties of log/3^ ( 7 ). The three main topics to be covered are 
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Logarithmic piots of (3 Versus N for fixed KL(p^ |p'') 





KL(p^ \p^ ) =7.80 xlO^^ 



N 


KL(p^ \p^ ) =8.56 xlO^^ 



N 


Figure 7. Approximate linear exponential decay of /3 as a func¬ 
tion of N, in thousands. Values of 7 in first row, from left to right 
are 4.67 x 10“^, 8.61 x 10“"*, in the second row, from left to right 
are 1.43 x 10“^ and 6.09 x 10“®. Plot is shown for 77 = 0.01. 


as follows: first, the choice of V, second the choice of 7 for each N, and third the 
choice of 70 (iV), which is the smallest meaningful test statistic, as function of N. 
We will explain below the precise meaning in this context of “smallest meaningful”. 

Choice of N for the table. The principal determining factor is that, as 
illustrated in Figures through]^ the linear dependence of log/S at on N becomes 
more reliable for large N, allowing us to make the interval AN between N in the 
table much larger as N grows. In practice, it seems to make sense, in our experience, 
to adopt a scheme such as 

• A7V = 5 for V < 100, 

• AiV = 10 for < 100 < V < 200, 

• AA = 50 for < 200 < A < 500, 

• AN = 100 for 500 < A < 1000, 

• AN = 1000 for 1000 < A < 10000, 
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Logarithmic piots of /3 Versus N for fixed KL(p^ \p’^ ) 






Figure 8. Approximate linear exponential decay of /3 as a func¬ 
tion of N in hundreds. Results shown for same values of 7, 77 as in 
Figure 


which is what we have adopted for the tables we built for our experiments. 

Choice of 7 in the table for each N. For the case of 7 < 77, we refer back 
to Figure ^ which illustrates that log (7) has a roughly linear dependence on 
KL{p^\p^j7 It will be convenient to re-parametrize the path-segment from to 
using the unit interval. In order the accomplish this we first parametrize the 
t-interval from t = 0 to (recall that <+ denotes the unique positive real value 
satisfying the condition T{p^{t^)) = 77 where p^ is the uniform distribution) using 
the statistic 


KL(/(t)|b^) 

KL(pO||p'') 


^ KL(/||p^) 

KL(p0||p’7) 


1, C(f+) 


KL(p^IIp^) 

KL{p^\\p'^) 


Note that 
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Logarithmic piots of (5 Versus N for fixed KL(p^ Ip”*) 






Figure 9. Values of /? for small N in intervals of 5 or 10. Results 
shown for same values of 7 , p as in Figure 


As a result, ^ is a smooth decreasing function which parameterizes the unit interval 
with in reverse order. Thus, the functional inverse parameterizes [ 0 ,f+] 

with the unit interval in reverse direction: 

C"^ : [0,1] [0,t+], withC"^(0) = f+ andC"^(l) =0. 

So if we compose with we obtain that p^{-) is a parameterization of 
the path-segment from p^ = p^{t^) to p° = p*^(0) with the unit interval, satisfying 

p°(-)oC-' : [0,1] ^ [p\p% withpO(C-i(0)) =p’' andp°(C-i(l)) =p°. 

Once we have parameterized the path from p'* to p® by the unit interval, the best 
way of thinking of our choice of 7 for the table is to think of our marking off a 
certain set of points on the unit interval, which we call the the “normalized KL- 
divergence list”. Since we care more about getting accurate values for ^(7) for 7 
close to 0, that is for points of the unit interval close to 1, the basic idea is to mark 
off the points of the “normalized KL-divergence list” so that they become more 
concentrated towards 1 . Many schemes could produce this effect, but the scheme 
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we have chosen is as follows. We create an object of type betaTable, and first 
invoke the method generate_N_List to generate the list of N values and store it 
as a data member. When we invoke on this same betaTable object the method 
generateNormalizedKL divergenceList with signature 

1 ge n e r at e N o r m a 1 iz e d K L _d i V e r g e n c e L i s t ( s e 1 f , stepsize , levelRatio , numLevels) 


The method produces a list of rational numbers separated by 

• stepsize units from 0 to 0.5, 

• stepsize/levelRatio units from 0.5 to 0 . 75 , 

• stepsize/levelRatio^, from 0.75 to 0.875 and so on numLevels times, 
so that the “ticks” closest to 1 are separated by only 

• stepsize/levelRatio"““’'^®'"®*® units. 

To clarify, here is an example of a test client invoking the method generateNor- 
malizedKL.divergenceList on an object of type betaTable class, with the out¬ 
put below the horizontal rule: 


1 import src . betaTable as bt 
class prettyfloat { float ) : 

3 def __repr__ { self ) : 

return ”%0.4f” % self 

5 eta=0.01 

newBetaTable = bt . bet aTab 1 e ( et a ) 

7 newBetaTable. g e n e r at e N o r m ali z ed K L _d i ver g en c e L i st (0.1 , 2,4) 

print map ( p r e 11 y f 1 o a t , newBetaTable. normalizedKL.divergenceList) 
9 print len (newBetaTable. nor malizedKL.divergenceList) 


11 

[0.0000 , 

0.1000, 0.2000, 

0.3000 , 

0.4000 , 

0.5000 , 


0.5500 , 

0.6000, 0.6500, 

0.7000 , 

0.7500 , 


13 

0.7750 , 

0.8000, 0.8250, 

0.8500 , 

0.8750 , 

0.8875, 0.9000 


0.9125 , 

0.9250, 0.9375, 

0.9500 , 

0.9625 , 

0.9750, 0.9875 

15 

25 






Because the interpolation of log / 3 ^ is actually performed on the statistic KL | ip’'), 
not 7, in practice the table is stored in memory as two separate tables of log/ 3 ’s 
associated to pairs (fV,KL(p'>'||p’')). The hrst table contains values for 7 < p, and 
the second table contains values 7 > p. In practice, as soon as a 7 = T{p(u!N,i, j)) 
is encountered in the evaluation of the objective function, 7 is converted to a KL- 
divergence and sent to the appropriate table for the interpolation. 

We have already explained how the 7 for 7 < p are selected, and it remains 
to explain how the 7 for 7 > p are chosen for tabulation. For the case of 7 > p, 
accuracy is far less crucial because in this case, the sparsity boost in the objective 
function is small. Therefore, we simply pick 10 evenly spaced points (on the 7 > p 
side of the path) between 0 and KL (p° (j) |p’') at which to evaluate and store the 
values of ( 3 . 

Finally, we discuss how to set the “threshold” value 70 (fV). In the course 
of the algorithm, 70 (fV) plays the following role. If we calculate a test statistic 
7 = T(p{uJiq),i,j) < 7o(iV), we will replace 7 with 7o(./V) before computing / 3 . In 
other words, in practice, the sparsity boost for 7 := t(w,4, R|s) is 

-log/S^" (max(7,7o(iV))). 

It is easiest to explain our reason for introducing such a 70 (fV) in the case when TV, 
the number of data points, is very small. For example, when TV = 4 , there are only 
35 contingency tables that the data could give rise to, and 25 of them correspond 
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to distributions in Vo- Thus, without such a 70( 4 ), there is a very high chance of 
the data sampled from any distribution being assigned the largest possible sparsity 
boost, namely — log ( 0 ). We have to find the smallest 70 such that the difference 
between /Sat (70) and I 3 n{.i') for 7' < 70 is meaningful, that is, is not a matter of 
arithmetic coincidence. In order to find a reasonable value for 70, we return to the 
definition of (Sat (7), 

Pn ( 7 ) := XI • ^a°{pt)- 

TgTn 

The main contribution to the sum comes from T such that pt belongs to a small 
number of paths (given by fixing the marginals) of probability distributions, namely 
those paths passing through the submanifold Vo at points close to the uniform 
distribution. The reason that only a few paths make a significant contribution to 
the probability is that the probability factor drops off sharply as a function of the 
marginals, for T such that the marginals of px are far from uniform. The locations 
of the points in the lattice pp^ have nothing to do with 7: they are determined 
by arithmetic. Thus, in the case that the length of the segment, l-y, is small in 
comparison to the spacing of the points of the lattice, whether a path contains 
any points of pp^ , and therefore makes a contribution to the sum is a matter of 
arithmetic coincidence. Therefore, we adopt a heuristic to avoid this circumstance, 
and choose 70 so that 


( 157 ) 

Here 


7o satisfies = 1 /N. 


is as defined in ( 1551 . 


Let 7o be defined by ( 157 ). Then 70 is the threshold such that for all 7 > 70 
we expect to have at least one point of pp^ fall on the segment of p°(t) between 
t~ and that is the portion of that path within the set A-^o. The point of the 
foregoing discussion is that if we obtain, empirically, a 7 value which is smaller than 
7o = 7o(-^)) then the empirical sparsity boost — ln/ 3 Ar( 7 ) will be “huge” simply 
because the path of distributions with uniform marginals contains no element of 
PPn ■ prevent large sparsity boosts from showing up by chance for small N by 
putting a lower threshold of 70 on the empirical mutual information. 

We implement the computation of 7o(fV), defined by ( 157 ), with the method 
generate N toMinimumGammaDict of the class betaTable. The results are 
plotted in Figure 10 to show the clear power law dependence of 70 (A^) on N. 

Bayesian-minded readers will doubtless observe that the problem that thresh¬ 
olding by 'yo{N) solves is an artifact of our making a prior distribution concentrated 
entirely at the point p^. An alternative to compensating for the choice of p'^ in this 
way would be to make the marginals of p^ “responsive” to the marginals of the 
empirical distribution, in particular for p"^ to be a distribution over distributions, 
instead of a point in V. We discuss this and other directions for future research in 
Chapter 

To summarize, the algorithm for the interpolation (or extrapolation) is as fol¬ 
lows: 


1 def float table . interpolateAndReturnLogBeta(N, gamma) : 

Stable : data structure {(N, gamma , logBeta) } 

3 if N < smallest tabulated N: 

return 0 

5 if N > largest tabulated N: 

return extrapolateAndReturnLogBeta(N, gamma ) 

7 :7)tgamma_0 (N) smallest tabulated for N ’near ’ given N 

if gamma < gamma_0(N) : 

gamma = gamma_0(N) 


9 
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Figure 10 . Power law dependence of jo{N) on N 
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return table . Linearinterpolate (N, gamma) 

def float table. extrapolatedValue(N, gamma) : 

Extract LargestN and SecondLargestN from table 
valLargest = interpolateAndReturnLogBeta( LargestN , gamma) 
valSecondLargest = interpolateAndReturnLogBeta(SecondLargestN , gamma) 
slope = (valLargestN — valSecondLargest)/ ( LargestN — SecondLargestN) 
return valLargest + slope*( N— largestN ) 


In practice, for the linear interpolation itself, we do not implement the function ta¬ 
ble.LinearInterpolate(N,gamma) but instead call scipy.interpolate.griddata 
with the default method, “linear”. Internally, this “tesselates the input point set 
to 2 -dimensional simplices, using Qhull, and interpolates linearly on each simplex”, 
which allows the interpolation to take account of all the neighboring points of the 
grid in a mathematically rigorous manner. 






CHAPTER 4 


Discussion 


We close by giving some further discussion of the relation of our results to those 
found in the literature, and outlining some directions for future research. 

1. Relation to previous literature 

Readers who are familiar with the results of [FY96j will note that we have 
an exponent of 2 on ^ (corresponding to their quantity e), where |FY96| have an 
exponent of |. We believe it may be possible to improve our exponent to their value 
through further analysis. Our exponent on C matches that obtained by |Hof93| . 
since we have used the methods of [Hoe65j in estimating the log-likelihood terms. 
Second, our exponent on m is 4 , where [FY96] have 2 . In our case, the higher 
exponent results from the analysis of the sparsity boost terms. There is a trade¬ 
off in terms of finite sample complexity in that, although using the sparsity boost 
terms gives worse finite sample complexity in terms of to, using the sparsity boost 
does allow the objective function to rule out all false structures that have even one 
false edge in the skeleton. The sample complexity results of |FY96| . for the usual 
BIC score, without the sparsity boost, make no such claim, ruling out only false 
structures in Finally, we point out that because we have used the methods of 
[Hof93| . we are, unlike |FY96| . obtaining a result that depends polynomially on 
n = |I/|, although unlike jFY96| . we are restricting ourselves to an constant bound 
d for the in-degree of all graphs to be considered. 

The approaches in the literature which are most closely related to our approach 
are the “hybrid” approaches, which use both constraint- and score-based approaches 
and attempt, through heuristic methods, to get the best of both worlds. Since the 
constraint-based approaches usually rely on independence tests, the resemblance to 
our method, at least at a philosophical level, is clear. The leading representatives 
of this approach are Fast’s “Greedy Relaxation Algorithm” (Relax) described in 
[Fas 10] and Tsamardinos et al.’s Max-Min Hill-Climbing (MMHC) described in 
jTBAOOj . The Relax algorithm starts by running constraint identification to 
learn constraints followed by edge orientation to produce an initial model. After 
the first model has been identified. Relax uses a local greedy search over possible 
relaxations of the constraints, at each step choosing the single constraint which, if 
relaxed, leads to the largest improvement in the score. The MMHC algorithm, in 
contrast, runs a constraint identification algorithm to learn an undirected skeleton 
only. Then MMHC builds up the final, directed network starting from the empty 
network, and then adding, deleting, and reversing edges, guided by a greedy opti¬ 
mization of the scoring metric. Note that both these approaches, though hybrid 
in a sense, still separate out the constraint- and score-based approaches into two 
distinct steps, rather than incorporating the constraint (conditional independence 
tests) as a term directly into the score itself. An a priori advantage of our approach 
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is that it removes the needs for heuristics from these hybrid approaches, and lends 
itself more to self-contained finite-sample complexity results such as the ones we 
have derived in this paper. This advantage we believe could be valuable in and of 
itself in making the subject more accessible to practitioners outside the specialty 
of Bayesian networks. Implicit in the experiments envisioned in the “Directions for 
future research” section below is to compare the experimental results obtained with 
our method to the published results of Relax and MMHC algorithms. 

2. Directions for future research 

As is well known, there is a hnite sample complexity result for the BIG score 
|FY96| . Intuitively we expect that under our scoring function, as compared to 
the BIG score, the true network will win out more quickly and by a larger margin 
over its competitors, because while the complexity penalty scales the same for 
every parameter in the model, the sparsity boost for different nonexistent edges 
differs. The boost increases to infinity precisely for those nonexistent edges for 
which accumulating evidence from the hypothesis tests points to independence. 
This will become important in the second step of the algorithm, the optimization. 

In order to optimize our score, virtually any optimization procedure can be 
used, but we propose using linear programming methods, similar to j.TSQTvn ni 
ICusllj . In preliminary experiments, |RS13| . we have found that as the amount of 
data increases LP relaxations become tight, thus providing the optimal solution to 
this maximization problem, with little computational overhead. This also suggests 
that local search-based methods would also be able to find the best-scoring Bayesian 
network in the large-sample limit. 

A key issue in the optimization is finding a good way of pruning a set of 
possible parents of each node. Even under the typical assumption that the size of 
all parent sets is bounded by a constant d, there are still far too many candidate 
parent sets for each variable. For the BIG score, there are criteria saying when one 
may exclude from consideration certain possible parent sets of a variable, and also, 
more signihcantly, all supersets of the excluded parent set (Section 2 of [dcz.Tnflj b 
The criterion significantly reduces the number of parent sets whose information 
(contribution to the score function) has to be cached, but we would still like to 
cut down on the number further. In this regard, the heuristic of |TBA06] for 
generating a list of candidate parent sets for each variable is of particular interest. 
We envision using a similar heuristic from the point of view of ordering the possible 
parent sets, but using our objective function instead of statistical independence 
tests. For more details, see our forthcoming paper |BS13| . 

Another issue to be explored as a future line of investigation is the choice of 
in within the objective function. The overall motivation for /S^’’ is to capture 
the probability of Type II error of a statistical test with independent distributions 
Vo as null hypothesis Hq and all distributions Vrj as alternative hypothesis Hi. The 
currently implemented choice of is a distribution with uniform marginals. This 
corresponds philosophically to replacing the more natural composite alternative 
hypothesis Hi = Vn with the more restrictive simple alternative hypothesis Hi = 
p^. Alternatively, fixing p^ this way, so that it is completely insensitive to the 
marginals of the empirical distribution p(wjv), is somewhat like adopting a strong 
Bayesian prior belief that the marginals of the underlying distribution are at or 
near uniform. It must be said that overall the choice of uniform marginals for p^ 
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represents an expedient choice, providing an objective function that is manageable 
to implement and compute, yet still has a reasonable sample complexity. The more 
philosophically grounded method of choosing Hi, taking account of every possible 
distribution in Vr^, would likely lead to an objective function that is prohibitively 
expensive to compute in practice. A compromise approach, which might yield the 
best results in the long run, is to use a simple alternative whose marginals 
can vary, approximating without necessarily matching those of p{(jjjq). This is one 
concrete possibility for future investigations to explore. More generally, one can 
contemplate incorporating various sorts of statistically derived probabilities into 
the objective function in order to arrive at a good sparsity boost, and it may very 
well turn out that some of them will be even more effective for certain purposes 
than the one considered here. 

This leads to the broader point that objective functions, and the optimization 
of them over discrete spaces of structures, are ubiquitous throughout probabilistic 
graphical models and fields of machine learning. The present investigations may 
suggest a new paradigm for incorporating information from “classical” hypothesis 
tests into the objective functions used in machine learning. These investigations, 
therefore, should be of broader interest to the machine learning community, far 
beyond researchers in the field of Bayesian networks. 



APPENDIX A 


Investigation of the /-projection 


In the first paragraph of Section [4.I[ we expressed our interest in characterizing 
more precisely the /-projection of onto A° (at least for the case k = I = 2) 
and promised to discuss this issue further in the appendices. While, according to 
general results, for example Theorem 8 . 6 , p. 277, in IKF09 |. the distribution p^ 
is the unique M-projection of p^ onto Ag, there do exist rj such that p"^ has two 
/-projections onto with equal KL-divergence from p^ (for example the two T 
projections of p^ with 77 = 0.2 are two product distributions with equal marginals 
(0.25, 0.75) and (0.75,0.25), respectively). This illustrates the general principle that 
it is difficult to compute or characterize the /-projection of an arbitrary distribution. 
We do not strictly need the results of this appendices for any of the proofs of 
the main text, but a better understanding of the /-projection would considerably 
simplify the discussion of Chaterj^by allowing us to conclude that the marginals of 
q'^ are always (at least in many cases of interest) close to the marginals of p^ and 
therefore entirely dispense with one branch of the “dichotomy” discussed in that 
section. This would lead to better constants in the finite sample complexity result 
and a simpler characterization of 

This serves to justify our interest in the following conjecture, which character¬ 
izes the /-projection for those p^ such that 77 is sufficiently close to 0 . 

Conjecture I. For all p"^ with <+ < ~ 0.115529289315002 (which 

is to say all 77 less than approximately 0.11094054602671935j we have that the /- 
projection ofp^ onto is the distribution p^ with uniform marginals on the bound¬ 
ary component of AJ closer to p^. 

According to the conjecture, we may assume without loss of generality {e.g., 
by decreasing rj to 0.1109 if necessary) that 

(158) q^ = p^ 

Proposition 23. Conjecture\^is true for the special case 0/7 = 0 . 

Proof. In that special case we are considering the /-projection of the distri¬ 
bution p'^ onto the space of product distributions Vq. By symmetry of p'^ we can 
see that any /-projection of p'^ onto Vo is a product distribution poix) of two mar¬ 
ginal distributions with identical parameters (x, 1 — x) with x € [0,1]. Further if 
one of the /-projections is the product distribution 770 ( 2 ;) of {x, 1 — x) with itself, 
then the the other /-projection is the product distribution 779(1 — x) of (1 — x,x) 
with itself (including the “degenerate” case of a: = ^ when these two product dis¬ 
tributions are coincidentally equal to one another ). As a result, we may assume 
without generality that x G [ 5 ,1]. Now, we set t = P* and write the KL-divergence 
H{po{x)\\p'^) in terms of the variables x,t by setting H{x,t) ■= ^^(Po(2;)||77’’). Let 
t be fixed (because t = tfj and 77 is fixed). The parameter x can correspond to 
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an /-projection p{x) of if and only if vanishes. We can compute the 

partial derivative explicitly and obtain 


i) = (X - 1) log - X log 

Make the change of variable 


(159) 


y ■= 


1 — X 

X 



\ 4 


The bounds new variables 


0<y<l, 0<z<l, 

say that the new variables lie in the unit square. Regrouping terms we obtain that 
the condition that the partial derivative vanishes is 


(160) 


log y + y log 2 — log z = 0. 


For each z in the unit interval equality (160) has two real solutions y, namely, 
(161) j/= Wo(zlogz)/(logz), andy = lF_i(zlogz)/(logz), 


where Wq and W-i are, principal real and unique non-principal real branches of 
the Lambert-W function respectively. We have to detemine which of these solutions 
yields a y in the unit interval. 

The function g{z) = zlogz maps the unit interval (0,1) onto the interval 
(0, —e“^) in a two-to-one fashion: namely, g{z) decreases on the interval (0, e“^) 
from the value 0 (which is how we define OlogO by convention) to the value —e~^. 
At that point, the graph of g{z) has a critical point (minimum) and thereafter 
increases on the interval (e“^, 1) from the value —e~^ to the value 0. 

The two points z, z' on opposite sides of the critical point e~^ which map to 
the same value g{z) = g{z') satisfy 


(162) 


log z z' 

log z' z 


Since g{z) = logz = (logz)e'°®^, the two compositions Wo{g{z)) and W-i(g{z)) 
must equal logz and logz'. Thus we obtain the two solutions to (160): 

log(z) 


y = 


log(z) 


= 1 , 


y = 


log(^') 

log(z) 


where the latter equality follows from (162). 


In order for j/ < 1 to be satisfied, 
we must have z < z'. By the above description of g{z), z must lie in the interval 
[0,e“^]. According to the change ofvariable (159) the point z = e~^ corresponds 
uniquely to 

1 — e~^ 

to = -TV « 0.115529289315002. 

4(1 -I- e-'-) 


The trivial solution y = 1 corresponds to the critical point at a: = and this 
is a critical point for every choice of t. In addition, we have shown that there is one 
additional critical point x, located in (i, 1), if and only if t > to ~ 0.115529289 ..., 
so by symmetry there is also a critical point at 1 — x. Since for t < to, the critical 
point at a: = i is the only critical point, it is easy to see that this unique critical 
point is a minimum, so the /-projection is the uniform distribution as claimed. □ 
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Remark 2. In the above proof, it is possible to verify that the non-trivial solu¬ 
tion comes from the principal branch Wq of the Lambert- W function in ( |161[ ), while 
the non-trivial non-solution, that is the y > 1 values in the z-interval [e~'-, 1 ] comes 
from the non-principal branch W-i. 



Figure 1. Dependence on 7 of I-projection of p^ onto Ag 

We conclude by giving some empirical evidence for the truth of Conjecture 
In Figure the marginal pA = Pb of the I-projection of p^ onto (i.e., the 
marginal called x in the above proof) corresponds to the minimum point of the 
KL-divergence curve. We have considered the case of 77 = 0.4, which is very nearly 
the case considered in Example 8.14, p. 275 of |KF09| . In that example, they 
consider only the case 7 = 0 and state only that there are two distinct minima of the 
function ¥Ai(p{x),p^). Here we can see from the shape of the curve the qualitative 
reason for this behavior: the KL-divergence of p^ from p^ always decreases very 
quickly as pA moves towards the central value of 0.5 away from the wall, but when 
p^ is sufficiently far from the uniform distribution (i, i), the KL divergence levels 
off whenp^i moves close enough to 1 / 2 , that is to say whenp'’' moves close enough to 
the uniform. When in this case 7 is sufficiently small, this behavior is pronounced 
enough to lead to an increase in KL{p{x),p^) after x gets close enough to 0.5. Thus, 
for example, in the cases of 7 = 0.0 and 7 = 0.0025, we can see a relative minimum 
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when X is roughly 0.15 and roughly 0.25, respectively. By the time 7 is 0.005, we no 
longer see this behavior occurring strongly enough to lead to a relative minimum. 

This then is the basis for believing that Conjecture [T is true for all smaller 
than the value stated in the Conjecture: the failure of tne statement in the con¬ 
jecture is associated with the existence of a relative minimum of the function 
KL(p(a;),p'') for some x € (0,0.5). The combination of empirical experiments 
(a sample of which is shown in Figure and heuristic reasoning shows that the 
relatve minimum has the “best chance” of appearing for 7 = 0, and Proposition!^ 
shows that no such relative minimum exists for 7 = 0 for in the stated range. 
Thus, we expect that no relative minima exist for any 7 > 0 for in the stated 
range. 



APPENDIX B 


Background on Entropy and Log Likelihood 


We now recall some well-known notions related to the log-likelihood term and 
estimates of this term. Experts in the field will probably find it possible to skip 
the entirety of this section with the exception of the characterization of the error- 
tolerance C as a KL-divergence at the very end of the section. We first recall the 
notation for conditional entropy. When y, Z are distinct subsets of the variable set 
A, and P is a (joint) probability distribution over A, we use 

(163) ilp{Z\y) := HpiZ,y) - Hp{y) 


where Hp{Z,y) and Hp{y) are the (ordinary) entropies of the probability distri¬ 
butions on the subsets Z uy and y, induced by P. 

There is a well-known decomposition of the LL(a;Ar,G), for any Bayesian net¬ 
work G, in terms of conditional entropy terms: 

n 

(164) LL(a;jv,G) = -Af ^ (W|PaG(A,)), 

i=l 


where Pa.G{Xi) is the parent set of node Xi in the DAG G. 

The expansion (164) allows us to define a more general version of the log- 
likelihood for any distribution B in place of the empirical distribution p{ujm)- 
Namely, we may use B to evaluate each of the conditional entropy terms, 


(165) LL^^(P,G) = -NY,HB{X,\Pe.G{X,)). 

i=l 


This expression is known as an idealized log-likelihood, because for many values of 
N it may not be possible to realize B as an empirical distribution p{ujn)- 

It is well-known that a probability distribution B is an /-map for G (that is 
to say, B satisfies the independence relationships expressed by the DAG G), if and 
only if B factors as follows: 


(166) B{X) =l[B,{X,\PaGX,) 

i=l 


for some collection of CPD’s Bi{Xi\PaGXi). It is well-known from the study of BN 
learning as optimization that LL^ujn^G) is the maximum of LL{ujntP) among all 
the distributions p factoring according to G. Following e.g. p. 5 of |FY96p we 
know that this maximum is achieved by p — Pg.ujn^ where 


PG,uj]^ (X) — tt Pi^jv,i(Xi|PaGa;i) 
i=l 

and where the parameters of the factor CPD’s are equal to the parameters induced 
by the appropriately marginalized frequency counts of 
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Because (165) extends the notion, (164), of log-likelihood of an empirical se¬ 
quence under a network G to idealized log-likelihood of any probability distribution 
B under G, we can extend this concept naturally and ask for probability distribu¬ 
tion PG,B which maximizes the “likelihood of B” under G subject to the condition 
that G is an I-map for Pg,b- 

The key to defining pg,b is to use the following relationship (see, e.g., p. 17 
of |FY960 between the log-likelihood of the data ojjq under an arbitrary network 
G G G and the KL-divergence of Pu,,^ based at pg,uin ■ 

LL(a;Ar,G) = NH{p{ujn)) - NH{p{ujn)\\pg,ljm)- 


Since the first term (involving entropy oi ujn) is fixed, and the second term has a 
minus sign, it is immediately clear that Pu>^^g can be characterized as the distribu¬ 
tion minimizing H{p{ujn)\\pg,uin) (among those which factorize according to G). 
More generally, the same computation yields, for any probability distribution over 

(167) G) = NH{B) - NH{B\\pg,b), 


where the distribution pg,b is defined by 


Pg,b{X) = '^pB{Xi\Pa.GXi), 
i=l 

and where the factor distributions in the product on the right-hand side are the con¬ 
ditional joint factors induced by the restriction and marginalization of B. That is 
to say, among all p which factor according to G, pg,b minimizes the KL-divergence 
H{B\pg,b) and also simultaneously is the distribution for which NH{B)—NH{B\\p) 
achieves its maximum value of LL^^(i?, G). 

For B any probability distribution and G any family of BN’s, and any C > 0 
we make the definitions 


• B is the set of Bayesian networks of the form (G, Pg,b) for all G G G- 

• is the subset of B consisting of those (G, Pg,b) for which Pg,b = B ot 
for which H{B\\Pg^b) > C- 

It is clear that if B (or P{ujn) for that matter) happens to factorize according to 
G, then Pg,b = B (resp. Puin,g = -P(wAr)). This explains why we have included 
the (G, Pg,b) for which Pg,b = B in the definition of B(^: we are mostly interested 
in applying the definition when B is the probability distribution for an unknown 
underlying Bayesian network from which a sequence ujn of samples is drawn. Then 
our finite-sample complexity theorem will give conditions under which the scoring 
function assigns its highest score to the true network {G,B), among all networks 
in B,^, where C is an error tolerance that is or may be under the control of the 
experimenter (learner). 

In order to understand the proofs of our main results, the reader has 
to keep in mind only the following fact concerning B and B(; 

The complement B\B(^ consists of those BN’s {G,P) such that 
P[{B\\P) < ( and P is the M-projection of B onto the set of net¬ 
works which are I-maps for B. 


Remark 3. The quantity not our e, corresponds to the e of [FY96] . Our e 
is more closely (though somewhat more general) than the minimium “information 
content” considered in |ZMD06| . 
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Proposition 24. Let e > 0, p € [0,1]. Denote by G (0,1) the empirical 
average of a sequence drawn from the Bernoulli distribution X(p). Assume that 
Pn > P-- Then we have 

Pr{\pNlogpN -p\ogp\ >e}< 5exp 



Proof. By elementary estimates, 

\p\ogp- pNlogpNl = Ip log p - Pat log p + PAT log p-pw log Pat I 
< Ip log p-pN logpl + \pN log p-pN logPAf I 


= Ip logpl 

1 _ ^ 

p 

+ PAr| logp 

- log PA 1 


< Ip logpl 

^ _ 1 

p 

+ \Pn - p\ 

, PN 

log- 

p 

+ p 

, PN 

log — 
P 


We now define each of the three events 


E, = . 

Ip logpl 

PA _ 1 



P 

3J 

E2 = < 

Ipa - p| 

, Pa 
log — 
P 

V 1 CO 

Al 


^^3 = i P 


, PN 
log — 

P 


e 

> - 
- 3 


and then use the union bound. 

By applying Lemma (stated below in Section directly to Ei , we obtain 

/ -Ne"^ 


PrEi < 2exp 


V27p|logp|2y ■ 
-1 


Since p S (0,1), we have, first that p ^ > 1 and second that 
minimum (in the unit interval) of ^ at e~^. Therefore, 

’ —Ne^e^' 


has its 


(168) 


Pri?i < 2exp 


4-27 


In order to estimate the probability of E 2 , we first prove the estimate 

2 

(169) \pN-p\ 

There are two cases: pat > p and p > pN. In case pN > P, 


, PA 

log — 

< P 

1 _ M 

P 


P 


Ipn - p| 
In case p > pat, 
\Pn - p\ 


, PN 
log- 

P 


PN 



1 _ M 

VP / 

p 


, PN 
log- 

P 


= {-{PN-P)) ( -log^ 


= {Pn - p) log 


Pn 


< P 


Pn 


- 1 =p 


1 - 


Pn 
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Because of (169), 

Pri ?2 < Pr 
so that by Lemma 
(170) 


1 _ m 


PrL ;2 < 2 exp 



^ PN 


> J - 

3p 


-Ne 

9 


In order to estimate Pi'-Es, we first use the assumption p > p to eliminate the 
absolute value signs around the logarithm, then use (|^: 


PN 


P 


3p 


PN 


PrEs = Pr log — > — ^ < Pr - 1 > — ^ = Pr <^ — > 1 + — 


P 


3p 


PN 


P 


3p 


So by Lemmaa), 

(171) Pri ?3 < exp 


-Ni^ 

I8p 


< exp 


-Ne^ 

18 


where again we have used the condition that p~^ > 1. Among the three terms on 
the right side of (168), (170), (171), the one with the smallest coefficient of —N 
inside the exponent is (171), so that, the other terms can be estimated by 2 times 
this term. We conclude from these three just cited estimates and the union bound 
that 

f -Ne^' 

PrjpAT logp — plogpl < 5exp 


18 


□ 
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